How to calculate precision, recall, and F1 score in Python

Learn how to calculate precision, recall, and F1 score in Python. Explore different methods, real-world applications, and debugging tips.

Published on:

Tue

Mar 17, 2026

Updated on:

Tue

Mar 24, 2026

The Replit Team

ON THIS PAGE

Example H2

Precision, recall, and F1 score are key metrics for machine learning model evaluation. They offer deeper insights than accuracy alone, particularly for imbalanced datasets. Python provides simple tools for their calculation.

In this article, you'll learn to calculate these metrics in Python. You'll explore different techniques, get practical tips, see real-world applications, and receive advice to debug common issues you might encounter.

Basic calculation using `sklearn.metrics`

from sklearn.metrics import precision_score, recall_score, f1_score y_true = [0, 1, 1, 0, 1, 1] y_pred = [0, 0, 1, 0, 0, 1] precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1: {f1:.2f}")--OUTPUT--Precision: 1.00, Recall: 0.50, F1: 0.67

The sklearn.metrics module offers a straightforward way to evaluate your model. The code uses two lists: y_true for the correct, ground-truth labels and y_pred for your model's predictions. By passing these lists to the precision_score, recall_score, and f1_score functions, you can quickly calculate each metric.

In this case, the precision is 1.00, which means every time the model predicted a positive case, it was right. The recall of 0.50, however, shows it only identified half of all actual positive cases. The F1 score balances these two, giving you a single performance measure.

Alternative implementation methods

While these functions are handy for a quick check, you'll often need a more granular view or a way to handle more complex classification tasks.

Manual calculation from confusion matrix

from sklearn.metrics import confusion_matrix import numpy as np y_true = [0, 1, 1, 0, 1, 1] y_pred = [0, 0, 1, 0, 0, 1] tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel() precision = tp / (tp + fp) recall = tp / (tp + fn) f1 = 2 * (precision * recall) / (precision + recall) print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1: {f1:.2f}")--OUTPUT--Precision: 1.00, Recall: 0.50, F1: 0.67

Manually calculating these metrics offers a look under the hood. The process starts with confusion_matrix, which tallies up your model's correct and incorrect predictions. The .ravel() function then unpacks these results into four key values:

tp (True Positives): Correctly identified positives.
fp (False Positives): Negatives incorrectly labeled as positive.
fn (False Negatives): Positives the model missed.

With these components, you can directly apply the standard formulas to find the precision, recall, and F1 score yourself.

Using `classification_report` for a comprehensive view

from sklearn.metrics import classification_report y_true = [0, 1, 1, 0, 1, 1] y_pred = [0, 0, 1, 0, 0, 1] report = classification_report(y_true, y_pred) print(report)--OUTPUT--precision recall f1-score support 0 0.67 1.00 0.80 2 1 1.00 0.50 0.67 4 accuracy 0.67 6 macro avg 0.83 0.75 0.73 6 weighted avg 0.89 0.67 0.71 6

For a complete picture of your model's performance, the classification_report function is your best tool. It bundles precision, recall, and F1 score into a single, easy to read report. Its main advantage is breaking down metrics for each class, showing you how well the model handles each category individually.

The report also includes:

support: The number of actual occurrences of each class in your data.
macro avg: The unweighted average of each metric across all classes.
weighted avg: The average score, weighted by the support for each class.

Calculating metrics for multi-class problems

from sklearn.metrics import precision_recall_fscore_support y_true = [0, 1, 2, 0, 1, 2] y_pred = [0, 2, 1, 0, 0, 2] precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average=None) print(f"Class-wise Precision: {precision}") print(f"Class-wise Recall: {recall}") print(f"Class-wise F1: {f1}")--OUTPUT--Class-wise Precision: [1. 0. 0.5] Class-wise Recall: [1. 0. 0.5] Class-wise F1: [1. 0. 0.5]

When you're dealing with more than two categories, you need a way to see how your model performs on each one. The precision_recall_fscore_support function is perfect for this. By setting average=None, you tell the function to return the metrics for each class separately instead of a single, averaged score.

The function returns separate arrays for precision, recall, and F1 score.
Each value in an array corresponds to a class, giving you a detailed performance breakdown.

Advanced applications and techniques

Beyond these fundamental calculations, you can gain deeper insights into your model's stability and trade-offs with more advanced evaluation techniques.

Cross-validation with metric scoring

from sklearn.model_selection import cross_val_score from sklearn.svm import SVC from sklearn.datasets import make_classification X, y = make_classification(n_samples=100, random_state=42) clf = SVC(kernel='linear', random_state=42) cv_precision = cross_val_score(clf, X, y, cv=5, scoring='precision') cv_recall = cross_val_score(clf, X, y, cv=5, scoring='recall') cv_f1 = cross_val_score(clf, X, y, cv=5, scoring='f1') print(f"CV Precision: {cv_precision.mean():.2f}, CV Recall: {cv_recall.mean():.2f}, CV F1: {cv_f1.mean():.2f}")--OUTPUT--CV Precision: 0.87, CV Recall: 0.91, CV F1: 0.89

Cross-validation gives you a more reliable measure of your model's performance. Instead of relying on a single train-test split, cross_val_score splits your data into multiple "folds"—in this case, five. It then trains and tests the model five times, using a different fold for testing each time.

The scoring parameter lets you specify which metric to calculate, such as 'precision' or 'recall'.
The function returns an array of scores, one for each fold.
Taking the .mean() of these scores gives you a more stable and trustworthy evaluation.

Custom metrics with different averaging methods

from sklearn.metrics import f1_score y_true = [0, 1, 2, 0, 1, 2] y_pred = [0, 2, 1, 0, 0, 2] # Different averaging methods macro_f1 = f1_score(y_true, y_pred, average='macro') weighted_f1 = f1_score(y_true, y_pred, average='weighted') micro_f1 = f1_score(y_true, y_pred, average='micro') print(f"Macro F1: {macro_f1:.2f}, Weighted F1: {weighted_f1:.2f}, Micro F1: {micro_f1:.2f}")--OUTPUT--Macro F1: 0.50, Weighted F1: 0.50, Micro F1: 0.50

The average parameter in functions like f1_score is key for summarizing multi-class performance into a single number. Your choice of averaging method depends on what you want to emphasize in your evaluation.

'macro': Calculates the metric for each class and finds the unweighted mean. This treats every class as equally important, which can be misleading if you have a class imbalance.
'weighted': Averages the per-class scores, but weights them by the number of true instances for each class. This better reflects performance on imbalanced datasets.
'micro': Calculates the metric globally by counting the total true positives, false negatives, and false positives. It gives each sample equal importance.

Visualizing precision-recall curves

import matplotlib.pyplot as plt from sklearn.metrics import precision_recall_curve from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression X, y = make_classification(n_samples=1000, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) clf = LogisticRegression().fit(X_train, y_train) y_scores = clf.predict_proba(X_test)[:, 1] precision, recall, _ = precision_recall_curve(y_test, y_scores) print("Precision-recall curve generated (visualization code only)")--OUTPUT--Precision-recall curve generated (visualization code only)

A curve visualizing precision and recall shows the trade-off between these metrics at different classification thresholds. This helps you see how your model behaves as you change its sensitivity. The code uses the precision_recall_curve function to generate the data needed for this plot.

It gets the model's prediction probabilities using predict_proba(), which are confidence scores rather than final labels.
The function then returns arrays of precision and recall values that you can plot to visualize their relationship.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

For the metric calculation techniques covered in this article, Replit Agent can turn them into production-ready tools:

Build a dashboard that visualizes the trade-offs from a precision_recall_curve for different model thresholds.
Create a model evaluation utility that generates a full classification_report from user-provided prediction data.
Deploy a tool that compares multi-class model performance using 'macro', 'weighted', and 'micro' averaging.

Describe your app idea, and Replit Agent writes the code, tests it, and fixes issues automatically. Try Replit Agent to bring your concepts to life.

Common errors and challenges

You might run into a few common errors when calculating these metrics, but they're simple to fix once you know what's happening.

Handling `ZeroDivisionError` in `precision_score` and `recall_score`

A ZeroDivisionError is a common hiccup when your model fails to predict any positive instances. For precision_score, this happens when the total number of predicted positives (tp + fp) is zero. For recall_score, it occurs when there are no actual positive instances in your data, making the denominator (tp + fn) zero.

You can handle this gracefully by setting the zero_division parameter in the scikit-learn function. Setting zero_division=0 will return a score of 0 for that metric instead of crashing your program, which is the most common approach.

Specifying correct `labels` parameter for multi-class metrics

In multi-class classification, if your model doesn't predict a certain class at all, that class might be missing from your report. This can give you an incomplete picture of performance. To avoid this, you should use the labels parameter in functions like classification_report or precision_recall_fscore_support.

By providing a list of all expected class labels—for example, labels=[0, 1, 2]—you ensure that every class is included in the output, even if its precision and recall are zero.

Using appropriate metrics for imbalanced datasets

Relying on accuracy alone is a classic pitfall with imbalanced datasets. If 95% of your samples belong to one class, a model can achieve 95% accuracy by simply always predicting that majority class, making it useless for identifying the minority class. This is why precision and recall are so critical.

These metrics help you understand how well the model performs on the less frequent class. The weighted avg in a classification_report is especially valuable here, as it accounts for the imbalance when calculating the average scores.

Handling `ZeroDivisionError` in `precision_score` and `recall_score`

A ZeroDivisionError can stop your script when using precision_score or recall_score. This error occurs when the denominator in the metric's formula is zero, which happens if there are no positive predictions or no actual positive samples. The following code triggers this error.

from sklearn.metrics import precision_score, recall_score y_true = [0, 0, 0, 0] # No positive samples y_pred = [0, 0, 0, 0] # No positive predictions precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) print(f"Precision: {precision}, Recall: {recall}")

With no positive samples in y_true and no positive predictions in y_pred, the denominators for both the precision and recall formulas become zero, triggering the error. The following example shows how to manage this scenario.

from sklearn.metrics import precision_score, recall_score y_true = [0, 0, 0, 0] # No positive samples y_pred = [0, 0, 0, 0] # No positive predictions precision = precision_score(y_true, y_pred, zero_division=0) recall = recall_score(y_true, y_pred, zero_division=0) print(f"Precision: {precision}, Recall: {recall}")

By setting the zero_division=0 parameter in precision_score and recall_score, you instruct the functions to return 0 instead of raising an error. This is a clean way to handle cases where the denominator for either metric is zero, which happens when there are no positive predictions or no true positive labels in the data. It’s a crucial adjustment for making your evaluation scripts more robust, especially when working with small or filtered datasets.

Specifying correct `labels` parameter for multi-class metrics

When a model fails to predict a class, metrics calculated with f1_score can be misleading. Scikit-learn may exclude the unpredicted class from its calculation, skewing the average. The following code sets up a scenario where this can easily happen.

from sklearn.metrics import f1_score y_true = [1, 2, 3, 1, 2, 3] # Classes 1, 2, 3 y_pred = [1, 3, 2, 1, 1, 3] f1 = f1_score(y_true, y_pred, average='macro') print(f"F1 score: {f1:.2f}")

The f1_score function calculates its average based only on the labels it finds in the data. This can mask issues, like a class being missed entirely. The next example shows how to ensure every class is counted.

from sklearn.metrics import f1_score y_true = [1, 2, 3, 1, 2, 3] # Classes 1, 2, 3 y_pred = [1, 3, 2, 1, 1, 3] f1 = f1_score(y_true, y_pred, average='macro', labels=[1, 2, 3]) print(f"F1 score: {f1:.2f}")

By passing labels=[1, 2, 3] to the f1_score function, you force it to account for every class, even those the model never predicted. This gives you a more honest performance metric instead of an inflated score that ignores the model's failures on certain classes. You should always use this parameter in multi-class scenarios to ensure your evaluation is complete and doesn't hide any blind spots in your model's predictions.

Using appropriate metrics for imbalanced datasets

Relying on accuracy with imbalanced datasets is a classic pitfall. When one class dominates, a model can achieve a high score by only predicting that majority class, rendering it useless. The following code shows how this can create a misleadingly high accuracy.

from sklearn.metrics import accuracy_score y_true = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1] # Imbalanced dataset y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # Model predicts majority class accuracy = accuracy_score(y_true, y_pred) print(f"Accuracy: {accuracy:.2f}")

The model's predictions consist entirely of the majority class 0, leading to a 90% accuracy_score. This high score masks its complete failure to identify the minority class. The following example offers a more insightful evaluation.

from sklearn.metrics import balanced_accuracy_score, f1_score y_true = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1] # Imbalanced dataset y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # Model predicts majority class bal_acc = balanced_accuracy_score(y_true, y_pred) f1 = f1_score(y_true, y_pred, average='weighted') print(f"Balanced accuracy: {bal_acc:.2f}, Weighted F1: {f1:.2f}")

The high accuracy score was deceptive. To get a truer picture, you’ll want to use metrics that account for the imbalance. This approach reveals the model's weakness in identifying the minority class.

balanced_accuracy_score calculates the average recall for each class, so the majority class can't dominate the score.
A weighted f1_score also adjusts for imbalance, giving you a more honest evaluation of performance across all classes.

Real-world applications

These metrics move from theory to practice when evaluating critical systems like spam filters and fraud detection models.

Evaluating a spam filter with `precision_score` and `recall_score`

For a spam filter, precision_score and recall_score help you measure the critical balance between blocking unwanted emails and ensuring legitimate ones reach your inbox.

from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import precision_score, recall_score messages = ["Free prize", "Meeting at 5", "Win now", "Call me"] labels = [1, 0, 1, 0] # 1 for spam, 0 for ham X = CountVectorizer().fit_transform(messages) model = MultinomialNB().fit(X, labels) predictions = model.predict(X) print(f"Spam detection - Precision: {precision_score(labels, predictions):.2f}") print(f"Spam detection - Recall: {recall_score(labels, predictions):.2f}")

This example builds a basic spam detector to show how precision and recall are used in a real scenario. The process starts by converting text into numbers the model can understand.

The CountVectorizer transforms the list of messages into a numerical feature matrix.
A MultinomialNB model is then trained on this data to learn the patterns of spam versus non-spam messages.

Finally, precision_score and recall_score evaluate how well the model's predictions match the true labels, giving you a clear measure of its effectiveness.

Comparing metrics for credit card fraud detection

In a high-stakes field like fraud detection, you must balance catching every fraudulent transaction with avoiding false alarms, making precision and recall essential for measuring this trade-off.

from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import precision_score, recall_score, f1_score # Simulate imbalanced fraud data X, y = make_classification(n_samples=1000, weights=[0.97, 0.03], random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) model = RandomForestClassifier(random_state=42).fit(X_train, y_train) y_pred = model.predict(X_test) print(f"Fraud detection - Precision: {precision_score(y_test, y_pred):.2f}") print(f"Fraud detection - Recall: {recall_score(y_test, y_pred):.2f}") print(f"Fraud detection - F1: {f1_score(y_test, y_pred):.2f}")

This code shows how to evaluate a model on a simulated, imbalanced dataset. The make_classification function generates data where the minority class—in this case, fraud—makes up only 3% of the samples. This mimics a common real-world scenario.

A RandomForestClassifier is trained on a portion of this data.
Its performance is then measured on the test set using precision_score, recall_score, and f1_score.

These metrics are vital for imbalanced problems because they reveal how well the model identifies rare positive cases, offering more insight than accuracy alone.

Get started with Replit

Turn what you've learned into a real tool. Describe your idea to Replit Agent, like “build a dashboard that visualizes a precision-recall curve” or “create a utility that generates a classification_report from prediction data.”

The agent writes the code, tests for errors, and deploys your application. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free