How to calculate cosine similarity in Python

Learn how to calculate cosine similarity in Python. Discover different methods, real-world applications, and tips for debugging common errors.

How to calculate cosine similarity in Python
Published on: 
Tue
Mar 10, 2026
Updated on: 
Fri
Mar 13, 2026
The Replit Team

Cosine similarity measures the orientation between two vectors in a multi-dimensional space. It's a key metric to compare document similarity in natural language processing and recommendation systems.

In this article, you'll explore techniques to calculate it in Python, with practical tips and real-world applications. You'll also get advice to debug common issues and master this concept.

Basic calculation with numpy

import numpy as np

vector_a = np.array([1, 2, 3])
vector_b = np.array([4, 5, 6])
cosine_similarity = np.dot(vector_a, vector_b) / (np.linalg.norm(vector_a) * np.linalg.norm(vector_b))
print(f"Cosine similarity: {cosine_similarity:.4f}")--OUTPUT--Cosine similarity: 0.9746

This calculation relies on two key numpy functions to implement the cosine similarity formula efficiently. It's a straightforward translation of the mathematical concept into code.

  • np.dot(vector_a, vector_b) computes the dot product, which forms the numerator.
  • np.linalg.norm() calculates the magnitude, or length, of each vector for the denominator.

Dividing the dot product by the product of the magnitudes gives you a score reflecting the vectors' orientation, not their size. The result of 0.9746 indicates the vectors are very closely aligned.

Common implementation methods

While the numpy method is a solid starting point, Python's ecosystem provides more specialized tools and the flexibility to implement the calculation from scratch.

Using scipy.spatial.distance

from scipy.spatial.distance import cosine

vector_a = [1, 2, 3]
vector_b = [4, 5, 6]
# scipy returns cosine distance, so we subtract from 1 to get similarity
cosine_similarity = 1 - cosine(vector_a, vector_b)
print(f"Cosine similarity: {cosine_similarity:.4f}")--OUTPUT--Cosine similarity: 0.9746

For a more specialized approach, you can use the cosine function from scipy.spatial.distance. This high-level utility handles the calculation for you, making the code cleaner and more direct.

  • The most important detail is that this function returns the cosine distance, not similarity. Distance measures how different vectors are, so a value of 0 means they're identical.
  • To get the similarity score, you just subtract the distance from 1, which is why the code uses 1 - cosine(vector_a, vector_b).

Implementing from scratch

def cosine_similarity(a, b):
dot_product = sum(x * y for x, y in zip(a, b))
magnitude_a = sum(x * x for x in a) ** 0.5
magnitude_b = sum(x * x for x in b) ** 0.5
return dot_product / (magnitude_a * magnitude_b)

print(f"Cosine similarity: {cosine_similarity([1, 2, 3], [4, 5, 6]):.4f}")--OUTPUT--Cosine similarity: 0.9746

Writing the function from scratch is a great way to understand the mechanics. It's a direct translation of the mathematical formula using standard Python features, offering full transparency without relying on external libraries.

  • The dot product is calculated using zip() to pair corresponding elements from both lists, which are then multiplied and summed.
  • Each vector's magnitude is found by squaring its elements, summing the results, and taking the square root with the ** 0.5 operator.

While this approach is perfect for learning, it's generally less performant than optimized library functions on large datasets.

Using sklearn for vector comparison

from sklearn.metrics.pairwise import cosine_similarity

vector_a = [[1, 2, 3]] # sklearn expects 2D arrays
vector_b = [[4, 5, 6]]
similarity = cosine_similarity(vector_a, vector_b)
print(f"Cosine similarity: {similarity[0][0]:.4f}")--OUTPUT--Cosine similarity: 0.9746

When working in a machine learning pipeline, sklearn is a go-to choice. Its cosine_similarity function is built to efficiently compare entire collections of vectors at once, making it highly scalable.

  • This design is why it expects 2D arrays as input. You need to wrap single vectors in an outer list, such as [[1, 2, 3]].
  • The function returns a similarity matrix. For a simple two-vector comparison, this results in a 1x1 matrix, so you retrieve the value using similarity[0][0].

Advanced techniques

Building on the basic calculations, you can now tackle sophisticated tasks like comparing text, optimizing for sparse data, and working with multi-dimensional similarity.

Calculating similarity between text documents

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

documents = ["Python is a programming language",
"Python is used for data science"]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
print(f"Document similarity: {similarity[0][0]:.4f}")--OUTPUT--Document similarity: 0.7071

You can't directly compare text, so the first step is converting it into numerical vectors. The TfidfVectorizer handles this by creating a numerical representation of each document based on word frequency and importance.

  • The fit_transform method learns the vocabulary from your documents and converts them into a TF-IDF matrix.
  • With the text now represented as vectors, cosine_similarity can measure their orientation to determine how similar the documents are in meaning.

Optimizing for large datasets with sparse matrices

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity

# Create sparse vectors
vector_a = csr_matrix([1, 0, 2, 0, 3])
vector_b = csr_matrix([0, 4, 5, 0, 6])
similarity = cosine_similarity(vector_a, vector_b)
print(f"Similarity with sparse vectors: {similarity[0][0]:.4f}")--OUTPUT--Similarity with sparse vectors: 0.8944

When dealing with large datasets, especially in text analysis, your vectors will often be "sparse"—meaning they're filled mostly with zeros. Storing all those zeros is inefficient, so it's better to use a format that only tracks non-zero values.

  • The csr_matrix from SciPy creates a compressed sparse row matrix, which saves significant memory by only storing the non-zero elements and their locations.
  • sklearn's cosine_similarity function is optimized to work directly with these sparse matrices, allowing for fast calculations without unpacking the data into a dense format.

This approach is essential for scaling your similarity computations efficiently.

Working with multi-dimensional cosine similarity

import numpy as np

# Multiple vectors comparison
vectors = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
norms = np.linalg.norm(vectors, axis=1)
normalized = vectors / norms[:, np.newaxis]
similarity_matrix = np.dot(normalized, normalized.T)
print("Similarity matrix:")
print(similarity_matrix.round(4))--OUTPUT--Similarity matrix:
[[1. 0.9746 0.9594]
[0.9746 1. 0.9959]
[0.9594 0.9959 1. ]]

You can efficiently compare multiple vectors at once using matrix operations—a method that’s much faster than looping through pairs. First, all vectors are normalized by dividing them by their respective magnitudes, which are calculated using np.linalg.norm with axis=1.

  • A single dot product between the normalized matrix and its transpose (normalized.T) produces a complete similarity matrix.
  • This matrix shows the similarity score for every vector pair, with the diagonal always being 1.0 since a vector is identical to itself.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

For the cosine similarity techniques you've explored, Replit Agent can turn them into production-ready tools:

  • Build a document similarity checker that scores the semantic closeness of two text files.
  • Create a basic recommendation engine that suggests similar items based on vector profiles.
  • Deploy a text analysis dashboard that visualizes document clusters based on their similarity scores.

Describe your app idea, and Replit Agent writes the code, tests it, and fixes issues automatically. Turn your concept into a working application with Replit Agent.

Common errors and challenges

Even with powerful libraries, you might run into issues like zero vectors, shape mismatches, or incorrect array dimensions.

  • Handling zero-magnitude vectors. A vector containing only zeros has a magnitude of zero, which causes a division-by-zero error in the cosine similarity formula. This often returns a NaN value. You can prevent this by checking for zero vectors and assigning a default similarity of 0.
  • Fixing shape mismatches. Cosine similarity requires vectors to have the same number of dimensions. If you try to compare vectors of different lengths, you'll likely get a ValueError. Ensure your data preprocessing pipeline creates vectors of a consistent size.
  • Using 2D arrays with `sklearn`. The sklearn.metrics.pairwise.cosine_similarity function is designed to work with collections of vectors, so it expects 2D array inputs. Passing a 1D list will cause an error; you must reshape single vectors into a 2D format, such as [[1, 2, 3]].

Handling zero magnitude vectors in cosine_similarity

A vector with all zero elements has a magnitude of zero. When you calculate its cosine similarity, this leads to a division-by-zero error because the vector's magnitude is in the denominator, often resulting in a NaN (Not a Number) value.

The following code demonstrates what happens when you attempt this calculation with a zero vector.

import numpy as np

# Vector with all zeros will cause division by zero
vector_a = np.array([0, 0, 0])
vector_b = np.array([4, 5, 6])

cosine_similarity = np.dot(vector_a, vector_b) / (np.linalg.norm(vector_a) * np.linalg.norm(vector_b))
print(f"Cosine similarity: {cosine_similarity}")

This code produces a NaN value because the magnitude of vector_a is zero, which causes a division-by-zero error. You can prevent this by adding a simple check, as the next example demonstrates.

import numpy as np

vector_a = np.array([0, 0, 0])
vector_b = np.array([4, 5, 6])

# Check for zero magnitude vectors
if np.linalg.norm(vector_a) == 0 or np.linalg.norm(vector_b) == 0:
cosine_similarity = 0.0
else:
cosine_similarity = np.dot(vector_a, vector_b) / (np.linalg.norm(vector_a) * np.linalg.norm(vector_b))

print(f"Cosine similarity: {cosine_similarity}")

The fix is to check each vector's magnitude before dividing. The code uses an if statement to see if np.linalg.norm() returns 0 for either vector. If it does, the similarity is set to 0.0, which sidesteps the division-by-zero error entirely.

This check is especially important in text analysis, where empty documents or out-of-vocabulary words can easily produce zero vectors, leading to unexpected NaN results in your output.

Fixing shape mismatches in vector comparisons

Cosine similarity requires vectors to have the same number of dimensions. If their lengths differ, the dot product operation will fail, causing a ValueError. This is a common issue when data isn't standardized. The code below shows this error in action.

import numpy as np

vector_a = np.array([1, 2, 3])
vector_b = np.array([4, 5, 6, 7]) # Different dimension

cosine_similarity = np.dot(vector_a, vector_b) / (np.linalg.norm(vector_a) * np.linalg.norm(vector_b))
print(f"Cosine similarity: {cosine_similarity}")

The np.dot function can't perform element-wise multiplication because the vectors have mismatched lengths. The following example shows how to address this issue before making the comparison.

import numpy as np

vector_a = np.array([1, 2, 3])
vector_b = np.array([4, 5, 6, 7]) # Different dimension

# Pad the shorter vector (or truncate if needed)
max_length = max(len(vector_a), len(vector_b))
padded_a = np.pad(vector_a, (0, max_length - len(vector_a)), 'constant')
padded_b = np.pad(vector_b, (0, max_length - len(vector_b)), 'constant')

cosine_similarity = np.dot(padded_a, padded_b) / (np.linalg.norm(padded_a) * np.linalg.norm(padded_b))
print(f"Cosine similarity: {cosine_similarity}")

The fix involves making both vectors the same length before comparison. The code uses np.pad() to add zeros to the end of the shorter vector until it matches the length of the longer one. This process, known as padding, ensures both vectors have identical dimensions, which is a requirement for the dot product calculation. This issue often appears when your data preprocessing pipeline produces vectors of inconsistent sizes, so it's a good practice to standardize vector lengths early.

Correcting issues with 2D arrays in sklearn.metrics.pairwise.cosine_similarity

The sklearn.metrics.pairwise.cosine_similarity function is built to compare entire sets of vectors, which is why it requires 2D array inputs. Passing a simple 1D list is a frequent error that results in a ValueError. The following code demonstrates this issue.

from sklearn.metrics.pairwise import cosine_similarity

vector_a = [1, 2, 3] # 1D arrays
vector_b = [4, 5, 6]

similarity = cosine_similarity(vector_a, vector_b)
print(f"Cosine similarity: {similarity}")

The cosine_similarity function is designed for batch comparisons, so it expects a list of vectors. Passing a single 1D list like [1, 2, 3] causes a shape mismatch and a ValueError. Here’s how to correct the input format.

from sklearn.metrics.pairwise import cosine_similarity

vector_a = [[1, 2, 3]] # Convert to 2D arrays
vector_b = [[4, 5, 6]]

similarity = cosine_similarity(vector_a, vector_b)
print(f"Cosine similarity: {similarity[0][0]}")

The fix is to wrap each vector in an outer list, like [[1, 2, 3]], which converts it into the 2D array that cosine_similarity expects. The function is designed for batch operations, so it always returns a similarity matrix. For a simple two-vector comparison, you can access the result with similarity[0][0]. Keep an eye out for this whenever you’re using sklearn to compare just two vectors instead of entire datasets.

Real-world applications

With a firm grasp on the calculations and error handling, you can now use cosine similarity to power real-world recommendation and search engines.

Building a movie recommendation system with cosine_similarity

You can find users with similar tastes by treating their movie ratings as vectors and calculating the cosine_similarity between them.

import numpy as np

# User ratings (rows=users, columns=movies)
ratings = np.array([
[5, 4, 0, 1], # User 1 (active user)
[4, 5, 3, 0], # User 2
[1, 0, 5, 4] # User 3
])

# Calculate similarity between User 1 and others
active_user = 0
similarities = []
for i in range(1, len(ratings)):
dot_product = np.dot(ratings[active_user], ratings[i])
norm_product = np.linalg.norm(ratings[active_user]) * np.linalg.norm(ratings[i])
similarity = dot_product / norm_product
similarities.append((i, similarity))

most_similar = max(similarities, key=lambda x: x[1])
print(f"Most similar user: User {most_similar[0] + 1}")
print(f"Similarity score: {most_similar[1]:.4f}")

This code finds which user has the most similar taste to the active user. The ratings array holds movie preferences, where each row represents a different person's ratings vector.

The script then compares the active user to everyone else by:

  • Calculating the dot product of their rating vectors using np.dot().
  • Finding the product of their vector magnitudes with np.linalg.norm().
  • Dividing these two results to get the final similarity score.

Finally, the max() function identifies the user with the highest score, revealing the closest match in taste.

Document similarity search using TfidfVectorizer and cosine_similarity

Combining TfidfVectorizer with cosine_similarity lets you build a search system that finds and ranks documents based on how semantically similar they are to a query document.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

documents = [
"Python is great for data science",
"Machine learning requires data",
"Python is widely used for machine learning",
"Natural language processing with Python"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Find documents similar to the first one
query_idx = 0
similarities = cosine_similarity(tfidf_matrix[query_idx:query_idx+1], tfidf_matrix)[0]

# Print results (excluding self-comparison)
for i, score in enumerate(similarities):
if i != query_idx:
print(f"Document {i+1}: similarity score {score:.4f}")

This script ranks documents by their semantic similarity to a query. It starts by using TfidfVectorizer to transform the raw text into a numerical TF-IDF matrix, which is a way to represent word importance in each document.

  • The cosine_similarity function then compares the first document's vector—the query—against all document vectors in the matrix.
  • This single operation produces an array of similarity scores, which the final loop prints out for review, skipping the document's comparison to itself.

Get started with Replit

Now, turn these concepts into a real tool. Describe what you want to build to Replit Agent, like "build a document similarity checker" or "create a basic movie recommender."

The agent writes the code, tests for errors, and deploys your app automatically. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.