How to remove missing values in Python

Learn to remove missing values in Python. This guide covers different methods, tips, real-world applications, and debugging common errors.

Published on:

Tue

Mar 17, 2026

Updated on:

Tue

Mar 24, 2026

The Replit Team

ON THIS PAGE

Example H2

Missing values are a common data analysis challenge that can skew results. Python provides robust tools to manage these gaps, ensure data integrity, and improve the accuracy of your models.

In this article, you'll explore techniques to remove missing data. You'll get practical tips, see real-world applications, and receive debugging advice to select the best method for your project.

Using `dropna()` to remove missing values

import pandas as pd import numpy as np df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]}) cleaned_df = df.dropna() print(df, "\n\nAfter dropna():\n", cleaned_df)--OUTPUT--A B 0 1.0 5.0 1 2.0 NaN 2 NaN NaN 3 4.0 8.0 After dropna(): A B 0 1.0 5.0 3 4.0 8.0

The dropna() function is a direct way to eliminate rows or columns with missing data. In the example, the original DataFrame df has NaN values in the second and third rows. By default, dropna() removes any row containing at least one missing value.

As a result, the cleaned_df DataFrame only retains the first and last rows, which were the only ones without any NaN values. This method is effective for quick data cleaning, but be mindful that it can discard a significant amount of information if your dataset has many gaps.

Basic techniques for handling missing values

Beyond the all-or-nothing approach of dropna(), you have more granular methods for filtering or replacing specific missing values throughout your dataset.

Filtering `NaN` values with NumPy

import numpy as np arr = np.array([1, 2, np.nan, 4, 5, np.nan]) filtered_arr = arr[~np.isnan(arr)] print("Original array:", arr) print("Filtered array:", filtered_arr)--OUTPUT--Original array: [ 1. 2. nan 4. 5. nan] Filtered array: [1. 2. 4. 5.]

When working with NumPy arrays, you can filter out NaN values using boolean indexing. The np.isnan() function is the key—it generates a boolean mask, marking True for every NaN value in your array.

By pairing it with the bitwise NOT operator (~), you invert this mask to highlight only the valid numbers.
Applying this inverted mask to the original array, arr, creates a new array, filtered_arr, that excludes all NaN values.

This method provides a concise way to clean numerical data directly within NumPy before further analysis.

Using boolean indexing to filter missing values

import pandas as pd import numpy as np series = pd.Series([1, 2, np.nan, 4, np.nan]) filtered_series = series[series.notna()] print("Original series:\n", series) print("\nFiltered series:\n", filtered_series)--OUTPUT--Original series: 0 1.0 1 2.0 2 NaN 3 4.0 4 NaN dtype: float64 Filtered series: 0 1.0 1 2.0 3 4.0 dtype: float64

For Pandas Series, you can use boolean indexing to filter out missing values. The notna() method is a powerful tool for this. It checks each element in the series and returns a new Series of boolean values.

True indicates the value is not missing.
False indicates the value is NaN.

When you use this boolean Series as an index, Pandas only keeps the elements where the boolean value is True. This creates a new, cleaned Series without the NaN values.

Replacing missing values with the `fillna()` method

import pandas as pd import numpy as np df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]}) filled_df = df.fillna(0) print("Original dataframe:\n", df) print("\nDataframe with filled values:\n", filled_df)--OUTPUT--Original dataframe: A B 0 1.0 5.0 1 2.0 NaN 2 NaN NaN 3 4.0 8.0 Dataframe with filled values: A B 0 1.0 5.0 1 2.0 0.0 2 0.0 0.0 3 4.0 8.0

Instead of removing data, you can replace missing values using the fillna() method. This approach is useful when you want to preserve your dataset's size. In the example, df.fillna(0) finds every NaN value in the DataFrame and replaces it with 0.

The method returns a new DataFrame, filled_df, with the gaps filled in, leaving the original untouched.
While this example uses a static value like 0, you can also use more complex strategies, such as filling with a column's mean or median.

Advanced techniques for handling missing values

When filling gaps with a single value isn't precise enough, you can employ more dynamic methods to preserve the underlying patterns in your data.

Using custom functions with `apply()` to handle missing values

import pandas as pd import numpy as np df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]}) df_custom = df.apply(lambda col: col.fillna(col.mean())) print(df_custom)--OUTPUT--A B 0 1.0 5.0 1 2.0 6.7 2 2.3 7.0 3 4.0 8.0

The apply() method offers a flexible way to handle missing data by running a function across each column. Here, a lambda function is used to dynamically fill NaN values. This approach is more sophisticated than using a single static value for the entire DataFrame.

For each column, the function first calculates its mean using col.mean().
It then uses fillna() to replace any missing values in that column with the calculated mean.

Interpolating missing values

import pandas as pd import numpy as np series = pd.Series([1, np.nan, np.nan, 4, 5]) interp_series = series.interpolate(method='linear') print("Original:\n", series) print("\nInterpolated:\n", interp_series)--OUTPUT--Original: 0 1.0 1 NaN 2 NaN 3 4.0 4 5.0 dtype: float64 Interpolated: 0 1.0 1 2.0 2 3.0 3 4.0 4 5.0 dtype: float64

Interpolation is a smart way to fill missing data by estimating values based on the data points around them. The Pandas interpolate() method is particularly useful for ordered data, like a time series, where values follow a logical sequence.

By setting method='linear', you instruct Pandas to treat the values as equally spaced.
It then fills the gaps by creating a straight line between the known points. In this case, it calculates the values between 1 and 4.
The two NaNs are replaced with 2 and 3, completing the arithmetic progression.

Using scikit-learn's `SimpleImputer` for missing values

import numpy as np from sklearn.impute import SimpleImputer import pandas as pd data = np.array([[1, 2], [np.nan, 3], [7, 6], [np.nan, 5]]) imputer = SimpleImputer(strategy='mean') imputed_data = imputer.fit_transform(data) print(pd.DataFrame(imputed_data, columns=['A', 'B']))--OUTPUT--A B 0 1.0 2.0 1 4.0 3.0 2 7.0 6.0 3 4.0 5.0

For machine learning tasks, scikit-learn’s SimpleImputer offers a robust way to handle missing data. It’s a preprocessing step that learns a strategy from your data and uses it to fill in the gaps.

You first create an imputer instance, setting the strategy to 'mean' to calculate each column's average.
The fit_transform() method then learns this mean from the non-missing values and replaces any NaNs with it.
In the example, it replaces the missing values in the first column with 4.0, which is the average of 1 and 7.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

For the data cleaning techniques from this article, Replit Agent can turn them into production applications:

Build a data cleaning utility that processes uploaded datasets, letting you remove incomplete records with dropna() or replace gaps using fillna().
Create a time-series dashboard that uses interpolate() to estimate and visualize missing sensor readings or financial data.
Deploy a machine learning preprocessor that automatically handles missing features in a training dataset with scikit-learn's SimpleImputer.

Get started by describing your application idea. Replit Agent will write the code, test it, and fix issues for you automatically.

Common errors and challenges

Navigating missing data can introduce subtle bugs, but understanding these common challenges will help you avoid them.

Avoiding mistakes when comparing with `np.nan` values

A frequent mistake is trying to find missing values using a direct comparison like df['column'] == np.nan. This won't work because, by design, np.nan is not considered equal to anything, not even itself. This behavior ensures that missing values don't accidentally match in comparisons.

To correctly identify NaN values, you should always use functions built for this purpose. In Pandas, the isna() and notna() methods are your best tools. For NumPy arrays, use the np.isnan() function to create a reliable boolean mask.

Handling unexpected results with `dropna()` parameters

Sometimes, running dropna() can remove far more data than you intended, especially if missing values are scattered across many rows. The default behavior removes any row containing even a single NaN. You can control this with its parameters.

Use how='all' to only drop rows or columns where every single value is missing.
Use the thresh parameter to specify the minimum number of non-missing values a row or column must have to be kept. For example, thresh=3 keeps rows that have at least three valid data points.

Debugging type errors after filling missing values

After you use fillna(), you might run into a TypeError later in your analysis. This often happens because filling missing values can change a column's data type, or dtype. For instance, a column of integers with a single NaN is stored as a float type; filling it with 0 will still leave it as a float column.

If you fill with a non-numeric value like the string 'missing', the entire column's dtype will change to object, which will cause errors with mathematical operations. Always check your column types with df.dtypes after filling data. If needed, you can convert a column back to its correct type using the astype() method, like df['column'].astype('int64').

Avoiding mistakes when comparing with `np.nan` values

You can't use the standard equality operator == to find np.nan values. This is because np.nan has a special property where it never equals itself. Attempting this comparison will return an empty result, as the following code demonstrates.

import pandas as pd import numpy as np df = pd.DataFrame({'A': [1, 2, np.nan, 4]}) # Incorrect way to filter rows with NaN filtered_df = df[df['A'] == np.nan] print("Rows with NaN:", filtered_df)

Because np.nan isn't equal to anything, even itself, the expression df['A'] == np.nan returns False for every row. This leaves you with an empty DataFrame. The following code demonstrates the correct method for this check.

import pandas as pd import numpy as np df = pd.DataFrame({'A': [1, 2, np.nan, 4]}) # Correct way to filter rows with NaN filtered_df = df[df['A'].isna()] print("Rows with NaN:", filtered_df)

The correct way to find missing values is with the isna() method. This function scans a Series and returns a boolean mask—True for every NaN and False otherwise. Using this mask as an index, as in df[df['A'].isna()], reliably selects only the rows with missing data. Always use this approach instead of direct comparison whenever you need to filter, count, or otherwise handle NaN values in your datasets to avoid empty results.

Handling unexpected results with `dropna()` parameters

The dropna() function can sometimes produce surprising results, leaving missing values in your dataset when you thought they'd be gone. This often happens when its parameters, like how, aren't configured for your specific goal. The code below shows this in action.

import pandas as pd import numpy as np df = pd.DataFrame({ 'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8] }) # This drops only rows where ALL values are NaN cleaned_df = df.dropna(how='all') print("After dropna(how='all'):\n", cleaned_df)

By setting how='all', you're telling dropna() to only remove rows where every value is missing. Since no row is completely empty, the function doesn't drop anything. The following code demonstrates the correct approach.

import pandas as pd import numpy as np df = pd.DataFrame({ 'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8] }) # This drops rows where ANY value is NaN cleaned_df = df.dropna(how='any') print("After dropna(how='any'):\n", cleaned_df)

The fix is to set the how parameter to 'any'. This instructs dropna() to remove a row if it contains even a single NaN value. It's the default setting and the most common use case for cleaning datasets. This approach is perfect when you need to work only with fully complete records. However, be careful, as it can aggressively remove data if missing values are scattered across many rows.

Debugging type errors after filling missing values

Filling missing values can sometimes lead to unexpected TypeError exceptions later in your code. This often happens because methods like fillna() can silently change a column's data type, or dtype, creating a mismatch with the operations you intend to perform.

For example, replacing a missing string with a number can convert the entire column, causing string methods to fail. The following code demonstrates how this can happen.

import pandas as pd import numpy as np df = pd.DataFrame({ 'id': ['A001', 'A002', 'A003', np.nan], 'value': [100, 200, 300, 400] }) df['id'] = df['id'].fillna(0) # Will cause error in string operations result = df['id'].str.upper() print(result)

The id column, which contains strings, is filled with the integer 0. When you apply a string method like str.upper(), it fails because the column now contains a number. The following code demonstrates the correct approach.

import pandas as pd import numpy as np df = pd.DataFrame({ 'id': ['A001', 'A002', 'A003', np.nan], 'value': [100, 200, 300, 400] }) df['id'] = df['id'].fillna('UNKNOWN') # Now string operations work properly result = df['id'].str.upper() print(result)

The fix is to fill the NaN with a value that matches the column's data type. Instead of using an integer like 0, the correct code uses the string 'UNKNOWN'. This keeps the id column's type consistent, allowing methods like str.upper() to run without a TypeError.

Always check your column's dtype after using fillna() to ensure it hasn't been unintentionally changed, especially before you perform type-specific operations.

Real-world applications

With these techniques and debugging strategies, you can confidently tackle real-world data challenges like cleaning sales reports and analyzing survey data.

Cleaning a sales dataset for reporting

To ensure your sales reports are accurate, you'll need to clean the raw data by using the fillna() method to handle missing values that would otherwise prevent calculations.

import pandas as pd import numpy as np # Sample sales dataset with missing values sales = pd.DataFrame({ 'date': pd.date_range('2023-01-01', periods=5), 'product': ['A', 'B', np.nan, 'A', 'C'], 'quantity': [10, np.nan, 15, 20, np.nan], 'price': [100, 200, np.nan, 100, 150] }) # Clean the dataset for reporting clean_sales = sales.copy() clean_sales['product'] = clean_sales['product'].fillna('Unknown') clean_sales[['quantity', 'price']] = clean_sales[['quantity', 'price']].fillna(0) clean_sales['total'] = clean_sales['quantity'] * clean_sales['price'] print("Original sales data:\n", sales) print("\nCleaned sales data for reporting:\n", clean_sales)

This code demonstrates a practical approach to cleaning a sales dataset. It first makes a copy of the sales DataFrame to keep the original data intact. Different cleaning strategies are then applied to different columns based on their data type.

The categorical product column has its missing values replaced with the string 'Unknown'.
The numerical quantity and price columns are filled with 0 using fillna().

After cleaning, a new total column is successfully calculated by multiplying the now-complete quantity and price columns.

Using multiple imputation techniques for survey data

Multiple imputation is particularly effective for survey data, as it fills gaps by modeling the relationships between variables to produce more accurate and realistic estimates.

import pandas as pd import numpy as np from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # Create a sample survey dataset with missing values np.random.seed(42) survey = pd.DataFrame({ 'age': [25, 30, np.nan, 45, 50, np.nan, 35], 'income': [50000, np.nan, 75000, 60000, np.nan, 80000, 65000], 'satisfaction': [7, 8, 6, np.nan, 9, 7, np.nan], 'years_customer': [2, 5, np.nan, 8, 10, 4, 6] }) print("Original survey data with missing values:") print(survey) # Use MICE (Multiple Imputation by Chained Equations) mice_imputer = IterativeImputer(max_iter=10, random_state=42) survey_mice = pd.DataFrame( mice_imputer.fit_transform(survey), columns=survey.columns ) print("\nAfter MICE imputation:") print(survey_mice.round(1))

This example uses scikit-learn's IterativeImputer to intelligently fill missing survey data. It works by treating each column with missing values as a prediction problem, using the other columns as features to estimate the missing entries.

The fit_transform() method applies this modeling process to the entire dataset.
It cycles through the columns multiple times—controlled by max_iter—to refine its estimates.

This creates a complete dataset where the filled values are contextually aware, based on patterns in the existing data.

Get started with Replit

Turn these techniques into a real tool. Tell Replit Agent: “Build a utility to clean CSVs using fillna()” or “Create a dashboard that uses interpolate() to fix time-series data.”

The agent writes the code, tests for errors, and deploys your application automatically. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free