How to calculate a distance matrix in Python

Learn how to calculate a distance matrix in Python. Explore different methods, tips, real-world applications, and how to debug common errors.

How to calculate a distance matrix in Python
Published on: 
Tue
Mar 10, 2026
Updated on: 
Fri
Mar 13, 2026
The Replit Team

A distance matrix quantifies the distances between all pairs of points in a dataset. In Python, you can compute this matrix for tasks in data analysis, machine learning, and logistics.

In this article, you'll learn several techniques to build a distance matrix. We'll cover practical tips, real-world applications, and debugging advice to help you implement the right solution for your project.

Basic distance matrix with NumPy

import numpy as np

points = np.array([[0, 0], [3, 0], [0, 4], [1, 1]])
distances = np.sqrt(((points[:, None, :] - points[None, :, :]) ** 2).sum(axis=2))
print(distances)--OUTPUT--[[0. 3. 4. 1.41421356]
[3. 0. 5. 2.23606798]
[4. 5. 0. 3.16227766]
[1.41421356 2.23606798 3.16227766 0.]]

This approach leverages NumPy's broadcasting to compute all distances at once, which is significantly faster than using loops. The magic happens in a single, dense line of code that builds on the Euclidean distance formula.

The core of the operation is points[:, None, :] - points[None, :, :]. By inserting None, you add new axes that cause NumPy to expand the arrays, effectively creating a matrix of differences between every pair of points. From there, the process is straightforward:

  • The differences are squared using the ** 2 operator.
  • .sum(axis=2) adds the squared x and y components together.
  • np.sqrt() calculates the final distance by taking the square root.

Common libraries for distance calculations

Beyond the manual NumPy approach, libraries like SciPy and scikit-learn offer dedicated functions to streamline distance matrix calculations using various metrics.

Using scipy.spatial.distance functions

import numpy as np
from scipy.spatial.distance import pdist, squareform

points = np.array([[0, 0], [3, 0], [0, 4], [1, 1]])
dist_condensed = pdist(points, metric='euclidean')
dist_matrix = squareform(dist_condensed)
print(dist_matrix)--OUTPUT--[[0.         3.         4.         1.41421356]
[3.         0.         5.         2.23606798]
[4.         5.         0.         3.16227766]
[1.41421356 2.23606798 3.16227766 0.        ]]

SciPy streamlines this task with two specialized functions. The pdist function first computes the distances between all pairs of points, but it returns them in a memory-efficient, one-dimensional array called a condensed distance matrix. This avoids storing duplicate information.

  • The pdist function calculates every unique distance once.
  • Then, squareform takes this compact array and converts it into the full, symmetric matrix you'd expect.

This approach is not only clean but also flexible, as pdist supports various distance metrics beyond just Euclidean.

Using scikit-learn's pairwise_distances

import numpy as np
from sklearn.metrics import pairwise_distances

points = np.array([[0, 0], [3, 0], [0, 4], [1, 1]])
dist_matrix = pairwise_distances(points, metric='euclidean')
print(dist_matrix)--OUTPUT--[[0.         3.         4.         1.41421356]
[3.         0.         5.         2.23606798]
[4.         5.         0.         3.16227766]
[1.41421356 2.23606798 3.16227766 0.        ]]

Scikit-learn's pairwise_distances function offers a very direct path to a distance matrix. Unlike SciPy's two-step process, this function computes and returns the full, symmetric matrix in a single call. You just need to pass your points and specify the distance metric.

  • It’s a one-shot operation, which keeps your code clean and easy to read.
  • Similar to SciPy's pdist, it supports many distance metrics, so you can easily switch from 'euclidean' to others as needed.

Using custom distance metrics

import numpy as np
from sklearn.metrics import pairwise_distances

points = np.array([[0, 0], [3, 0], [0, 4], [1, 1]])
manhattan_dist = pairwise_distances(points, metric='manhattan')
cosine_dist = pairwise_distances(points, metric='cosine')
print("Manhattan distances:\n", manhattan_dist)--OUTPUT--Manhattan distances:
[[0. 3. 4. 2.]
[3. 0. 7. 3.]
[4. 7. 0. 4.]
[2. 3. 4. 0.]]

The real power of functions like pairwise_distances is their flexibility. You aren't limited to just one type of distance. By simply changing the metric parameter, you can instruct the function to use different mathematical rules for its calculations.

  • metric='manhattan' calculates distance as if you were navigating a city grid, summing the absolute differences along each axis. It's often called "taxicab geometry."
  • metric='cosine' measures the angle between two points, which is useful for comparing orientation rather than magnitude, especially in text analysis.

Advanced techniques and optimizations

When standard library functions aren't enough, you can turn to more advanced techniques for handling large datasets or specialized calculations like geographic distances.

Vectorized operations for large datasets

import numpy as np

def fast_euclidean_distmat(X):
   # Compute squared Euclidean distance matrix without loops
   # ||a - b||^2 = ||a||^2 + ||b||^2 - 2 * a.dot(b)
   XX = np.sum(X*X, axis=1)[:, np.newaxis]
   distances = XX + XX.T - 2 * X.dot(X.T)
   return np.sqrt(np.maximum(distances, 0))  # prevent small negative values

points = np.array([[0, 0], [3, 0], [0, 4], [1, 1]])
print(fast_euclidean_distmat(points))--OUTPUT--[[0.         3.         4.         1.41421356]
[3.         0.         5.         2.23606798]
[4.         5.         0.         3.16227766]
[1.41421356 2.23606798 3.16227766 0.        ]]

For very large datasets, you can gain a performance edge by reframing the calculation. This function, fast_euclidean_distmat, uses a linear algebra identity to compute squared distances: ||a - b||^2 = ||a||^2 + ||b||^2 - 2 * a.dot(b). This method relies on highly optimized matrix multiplication, which can be faster than broadcasting.

  • It calculates the squared magnitude of each point vector.
  • It uses X.dot(X.T) to find the dot product between all pairs of points.
  • The np.maximum function is a safeguard against small floating-point errors before taking the final square root.

Accelerating calculations with numba

import numpy as np
import numba

@numba.njit(parallel=True)
def numba_distmat(X):
   n = X.shape[0]
   dist_matrix = np.zeros((n, n))
   for i in numba.prange(n):
       for j in range(n):
           dist_matrix[i, j] = np.sqrt(np.sum((X[i] - X[j])**2))
   return dist_matrix

points = np.array([[0, 0], [3, 0], [0, 4], [1, 1]])
print(numba_distmat(points))--OUTPUT--[[0.         3.         4.         1.41421356]
[3.         0.         5.         2.23606798]
[4.         5.         0.         3.16227766]
[1.41421356 2.23606798 3.16227766 0.        ]]

Numba acts as a just-in-time (JIT) compiler that can dramatically speed up your code. By adding the @numba.njit decorator, you instruct it to convert a standard Python function—even one with slow loops—into highly optimized machine code. This approach shines when you need raw performance without abandoning Python's simplicity.

  • The parallel=True argument unlocks automatic parallelization.
  • Using numba.prange instead of a regular range loop allows Numba to safely distribute the work across multiple CPU cores, offering a massive performance boost for large datasets.

Calculating distance matrices for geographic coordinates

import numpy as np
from geopy.distance import geodesic

# Coordinates as (latitude, longitude) pairs
locations = [(40.7128, -74.0060),  # New York
            (34.0522, -118.2437), # Los Angeles
            (51.5074, -0.1278),   # London
            (35.6762, 139.6503)]  # Tokyo

n = len(locations)
geo_dist = np.zeros((n, n))

for i in range(n):
   for j in range(n):
       geo_dist[i, j] = geodesic(locations[i], locations[j]).kilometers

print(geo_dist)--OUTPUT--[[    0.         3935.9463336  5570.24146795 10838.00808896]
[3935.9463336     0.         8786.03389706  9501.85378344]
[5570.24146795 8786.03389706     0.         9560.12965015]
[10838.00808896 9501.85378344 9560.12965015     0.        ]]

Calculating distances between geographic coordinates requires a different approach because the Earth is a sphere, not a flat plane. The geopy library handles this complexity by calculating the shortest path along the Earth's surface.

  • The geodesic function is the key here. It takes two (latitude, longitude) tuples and accurately computes the distance between them.
  • The code iterates through each pair of locations, calls geodesic(...).kilometers to get the result, and populates the final distance matrix.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

For the distance calculation techniques we've explored, Replit Agent can turn them into production tools:

  • Build a logistics calculator that generates a distance matrix for multiple delivery locations using their geographic coordinates.
  • Create a document similarity checker that uses cosine distance to find related text files in a dataset.
  • Deploy a data analysis utility that computes a distance matrix from user-uploaded data points using Euclidean or Manhattan metrics.

Describe your app idea, and Replit Agent writes the code, tests it, and fixes issues automatically, all in your browser.

Common errors and challenges

When calculating distance matrices, you might run into issues like negative values or memory errors, but these challenges are manageable with the right approach.

Fixing negative values in distance calculations with np.maximum

You may occasionally find negative numbers in your squared distance matrix before taking the final square root. This isn't a bug in your logic; it's a quirk of floating-point arithmetic where tiny precision errors can make a zero result slightly negative.

  • Attempting to take the square root of a negative number produces a NaN (Not a Number) value, which can corrupt your entire matrix.
  • To prevent this, you can use NumPy's np.maximum() function. By wrapping your squared distance calculation with np.maximum(your_distances, 0), you effectively clamp all negative values to zero, ensuring a clean final result.

Resolving memory errors with large datasets using chunking

If you're working with a very large number of points, trying to compute the full distance matrix at once can trigger a MemoryError. The resulting matrix may simply be too big to fit into your computer's RAM.

The solution is to process your data in chunks. Instead of one massive calculation, you can loop through smaller batches of points, compute their distances against the full dataset, and then combine the results. This approach trades a little computational time for the ability to work with datasets of virtually any size.

Handling NaN values in distance matrix calculations

Missing data, represented as NaN values in your input array, will also cause problems. Any distance calculation involving a NaN will return NaN, spreading missing values throughout your final matrix.

You have a few options for handling this before you compute the matrix:

  • Removal: The most direct method is to filter out any rows that contain NaN values. This works well if you have plenty of data to spare.
  • Imputation: You can replace the NaNs with a calculated value, such as the mean or median of the respective column. This preserves your data points.
  • Use a NaN-aware metric: Some functions, like scikit-learn's pairwise_distances, offer metrics such as 'nan_euclidean' that are specifically designed to handle missing values during the calculation.

Fixing negative values in distance calculations with np.maximum

Floating-point math isn't always perfect. Tiny precision errors can cause squared distances that should be zero to become slightly negative. This leads to NaN values when you take the square root. The code below shows this problem in action.

import numpy as np

# Create a case that produces numerical errors
points = np.array([[0, 0], [1e-10, 1e-10], [3, 4]])
XX = np.sum(points**2, axis=1)[:, np.newaxis]
distances_squared = XX + XX.T - 2 * points.dot(points.T)
# This can produce small negative values due to floating-point errors
distance_matrix = np.sqrt(distances_squared)
print(distance_matrix)

The formula XX + XX.T - 2 * points.dot(points.T) can produce tiny negative values instead of zero due to precision loss. Applying np.sqrt() to these negatives causes NaN errors. The corrected implementation below shows how to prevent this.

import numpy as np

points = np.array([[0, 0], [1e-10, 1e-10], [3, 4]])
XX = np.sum(points**2, axis=1)[:, np.newaxis]
distances_squared = XX + XX.T - 2 * points.dot(points.T)
# Use np.maximum to ensure non-negative values
distance_matrix = np.sqrt(np.maximum(distances_squared, 0))
print(distance_matrix)

The solution is to use np.maximum() to safeguard your calculation. By wrapping the squared distances in np.maximum(distances_squared, 0), you replace any small negative values with zero before taking the square root. This effectively prevents NaN errors from appearing in your final matrix. You'll want to keep an eye out for this issue particularly when using vectorized formulas, as floating-point errors are more common in those cases, especially with very close points.

Resolving memory errors with large datasets using chunking

Calculating a distance matrix for a large dataset can quickly exhaust your system's memory. A 10,000-point dataset requires a 10,000 x 10,000 matrix, which can consume gigabytes of RAM and trigger a MemoryError. The code below demonstrates this problem.

import numpy as np

# Generate a large dataset
n_samples = 10000
points = np.random.rand(n_samples, 2)

# This can cause memory errors for large datasets
distances = np.sqrt(((points[:, None, :] - points[None, :, :]) ** 2).sum(axis=2))
print(f"Full distance matrix shape: {distances.shape}")

The broadcasting operation points[:, None, :] - points[None, :, :] tries to build a massive intermediate array with all point differences simultaneously. This single step is what can overwhelm your system's memory. See how to avoid this below.

import numpy as np
from sklearn.metrics import pairwise_distances_chunked

# Generate a large dataset
n_samples = 10000
points = np.random.rand(n_samples, 2)

# Using chunking to process submatrices
chunks = pairwise_distances_chunked(points, metric='euclidean', working_memory=100)
for chunk in chunks:
   print(f"Processing chunk of shape: {chunk.shape}")
   break  # Just showing the first chunk

The solution is to process the data in batches. Scikit-learn's pairwise_distances_chunked function breaks the calculation into smaller, manageable chunks. Instead of creating one giant matrix, it computes sub-matrices that fit within a specified memory limit, which you can control with the working_memory parameter. This approach lets you handle massive datasets without triggering a MemoryError. You'll want to use this technique whenever a full distance matrix would exceed your available RAM.

Handling NaN values in distance matrix calculations

When your input data contains missing values, represented as NaN, they can corrupt your entire distance matrix. Any mathematical operation with a NaN produces another NaN, causing a chain reaction that renders your results useless. The following code shows this problem in action.

import numpy as np

# Dataset with missing values
points = np.array([[0, 0], [3, np.nan], [0, 4], [1, 1]])

# This results in NaN values in the distance matrix
distances = np.sqrt(((points[:, None, :] - points[None, :, :]) ** 2).sum(axis=2))
print(distances)

The broadcasting operation attempts to subtract from np.nan, which contaminates the entire row and column associated with that point. This spreads NaN values throughout the final matrix. The corrected implementation below shows how to address this.

import numpy as np
from sklearn.impute import SimpleImputer

# Dataset with missing values
points = np.array([[0, 0], [3, np.nan], [0, 4], [1, 1]])

# Impute missing values with the mean
imputer = SimpleImputer(strategy='mean')
points_imputed = imputer.fit_transform(points)
distances = np.sqrt(((points_imputed[:, None, :] - points_imputed[None, :, :]) ** 2).sum(axis=2))
print(distances)

The solution is to handle NaN values before calculating distances. This process, called imputation, cleans the data so calculations can proceed without errors.

  • Use scikit-learn's SimpleImputer to replace missing data.
  • The strategy='mean' option replaces each NaN with the average value of its column.

You'll want to check for NaNs whenever your data comes from external sources like files or APIs, as they often contain missing entries.

Real-world applications

These calculations are the engine behind powerful applications, from clustering data with KMeans to analyzing similarities between documents.

Clustering data with KMeans using distance calculations

At its core, the KMeans algorithm relies on distance calculations to partition data into a specified number of clusters, assigning each point to the nearest cluster center.

import numpy as np
from sklearn.cluster import KMeans

# Sample 2D data points
points = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])

kmeans = KMeans(n_clusters=2, random_state=0).fit(points)
print("Cluster assignments:", kmeans.labels_)
print("Cluster centers:", kmeans.cluster_centers_)

This code uses scikit-learn's KMeans to automatically sort the data points into two distinct groups, as defined by n_clusters=2. The fit method runs the algorithm and organizes the data. Once complete, you can access the results:

  • The kmeans.labels_ attribute is an array that tells you which cluster—either 0 or 1—each point belongs to.
  • The kmeans.cluster_centers_ attribute provides the coordinates for the calculated center of each of those two groups.

Document similarity analysis with distance matrices

You can also use distance calculations to measure how similar documents are by first converting the text into numerical vectors.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

documents = [
   "Distance matrices are useful in many applications",
   "NumPy provides efficient tools for mathematical operations",
   "Matrices help calculate distances between points",
   "Python is a popular programming language for data science"
]

tfidf_matrix = TfidfVectorizer().fit_transform(documents)
similarity_matrix = cosine_similarity(tfidf_matrix)
print(np.round(similarity_matrix, 2))

This code gauges how similar documents are by first converting their text into numerical data. It uses a two-step process to compare the content of each sentence.

  • First, TfidfVectorizer turns each document into a vector. This vector represents the importance of each word in the document relative to the entire collection.
  • Next, cosine_similarity calculates the similarity between these vectors. The resulting matrix shows a score for every pair of documents, indicating how closely related they are.

Get started with Replit

Turn what you've learned into a real tool with Replit agent. Try a prompt like "build a geographic distance calculator for a list of cities" or "create an app that finds similar documents using cosine distance."

The agent writes the code, tests for errors, and deploys your app from a single prompt. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.