How to remove missing values in Python
Learn to remove missing values in Python. This guide covers different methods, tips, real-world applications, and debugging common errors.

Missing values are a common data analysis challenge that can skew results. Python provides robust tools to manage these gaps, ensure data integrity, and improve the accuracy of your models.
In this article, you'll explore techniques to remove missing data. You'll get practical tips, see real-world applications, and receive debugging advice to select the best method for your project.
Using dropna() to remove missing values
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]})
cleaned_df = df.dropna()
print(df, "\n\nAfter dropna():\n", cleaned_df)--OUTPUT--A B
0 1.0 5.0
1 2.0 NaN
2 NaN NaN
3 4.0 8.0
After dropna():
A B
0 1.0 5.0
3 4.0 8.0
The dropna() function is a direct way to eliminate rows or columns with missing data. In the example, the original DataFrame df has NaN values in the second and third rows. By default, dropna() removes any row containing at least one missing value.
As a result, the cleaned_df DataFrame only retains the first and last rows, which were the only ones without any NaN values. This method is effective for quick data cleaning, but be mindful that it can discard a significant amount of information if your dataset has many gaps.
Basic techniques for handling missing values
Beyond the all-or-nothing approach of dropna(), you have more granular methods for filtering or replacing specific missing values throughout your dataset.
Filtering NaN values with NumPy
import numpy as np
arr = np.array([1, 2, np.nan, 4, 5, np.nan])
filtered_arr = arr[~np.isnan(arr)]
print("Original array:", arr)
print("Filtered array:", filtered_arr)--OUTPUT--Original array: [ 1. 2. nan 4. 5. nan]
Filtered array: [1. 2. 4. 5.]
When working with NumPy arrays, you can filter out NaN values using boolean indexing. The np.isnan() function is the key—it generates a boolean mask, marking True for every NaN value in your array.
- By pairing it with the bitwise NOT operator (
~), you invert this mask to highlight only the valid numbers. - Applying this inverted mask to the original array,
arr, creates a new array,filtered_arr, that excludes allNaNvalues.
This method provides a concise way to clean numerical data directly within NumPy before further analysis.
Using boolean indexing to filter missing values
import pandas as pd
import numpy as np
series = pd.Series([1, 2, np.nan, 4, np.nan])
filtered_series = series[series.notna()]
print("Original series:\n", series)
print("\nFiltered series:\n", filtered_series)--OUTPUT--Original series:
0 1.0
1 2.0
2 NaN
3 4.0
4 NaN
dtype: float64
Filtered series:
0 1.0
1 2.0
3 4.0
dtype: float64
For Pandas Series, you can use boolean indexing to filter out missing values. The notna() method is a powerful tool for this. It checks each element in the series and returns a new Series of boolean values.
Trueindicates the value is not missing.Falseindicates the value isNaN.
When you use this boolean Series as an index, Pandas only keeps the elements where the boolean value is True. This creates a new, cleaned Series without the NaN values.
Replacing missing values with the fillna() method
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]})
filled_df = df.fillna(0)
print("Original dataframe:\n", df)
print("\nDataframe with filled values:\n", filled_df)--OUTPUT--Original dataframe:
A B
0 1.0 5.0
1 2.0 NaN
2 NaN NaN
3 4.0 8.0
Dataframe with filled values:
A B
0 1.0 5.0
1 2.0 0.0
2 0.0 0.0
3 4.0 8.0
Instead of removing data, you can replace missing values using the fillna() method. This approach is useful when you want to preserve your dataset's size. In the example, df.fillna(0) finds every NaN value in the DataFrame and replaces it with 0.
- The method returns a new DataFrame,
filled_df, with the gaps filled in, leaving the original untouched. - While this example uses a static value like
0, you can also use more complex strategies, such as filling with a column's mean or median.
Advanced techniques for handling missing values
When filling gaps with a single value isn't precise enough, you can employ more dynamic methods to preserve the underlying patterns in your data.
Using custom functions with apply() to handle missing values
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]})
df_custom = df.apply(lambda col: col.fillna(col.mean()))
print(df_custom)--OUTPUT--A B
0 1.0 5.0
1 2.0 6.7
2 2.3 7.0
3 4.0 8.0
The apply() method offers a flexible way to handle missing data by running a function across each column. Here, a lambda function is used to dynamically fill NaN values. This approach is more sophisticated than using a single static value for the entire DataFrame.
- For each column, the function first calculates its mean using
col.mean(). - It then uses
fillna()to replace any missing values in that column with the calculated mean.
Interpolating missing values
import pandas as pd
import numpy as np
series = pd.Series([1, np.nan, np.nan, 4, 5])
interp_series = series.interpolate(method='linear')
print("Original:\n", series)
print("\nInterpolated:\n", interp_series)--OUTPUT--Original:
0 1.0
1 NaN
2 NaN
3 4.0
4 5.0
dtype: float64
Interpolated:
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
dtype: float64
Interpolation is a smart way to fill missing data by estimating values based on the data points around them. The Pandas interpolate() method is particularly useful for ordered data, like a time series, where values follow a logical sequence.
- By setting
method='linear', you instruct Pandas to treat the values as equally spaced. - It then fills the gaps by creating a straight line between the known points. In this case, it calculates the values between
1and4. - The two
NaNs are replaced with2and3, completing the arithmetic progression.
Using scikit-learn's SimpleImputer for missing values
import numpy as np
from sklearn.impute import SimpleImputer
import pandas as pd
data = np.array([[1, 2], [np.nan, 3], [7, 6], [np.nan, 5]])
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(data)
print(pd.DataFrame(imputed_data, columns=['A', 'B']))--OUTPUT--A B
0 1.0 2.0
1 4.0 3.0
2 7.0 6.0
3 4.0 5.0
For machine learning tasks, scikit-learn’s SimpleImputer offers a robust way to handle missing data. It’s a preprocessing step that learns a strategy from your data and uses it to fill in the gaps.
- You first create an
imputerinstance, setting thestrategyto'mean'to calculate each column's average. - The
fit_transform()method then learns this mean from the non-missing values and replaces anyNaNs with it. - In the example, it replaces the missing values in the first column with 4.0, which is the average of 1 and 7.
Move faster with Replit
Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.
For the data cleaning techniques from this article, Replit Agent can turn them into production applications:
- Build a data cleaning utility that processes uploaded datasets, letting you remove incomplete records with
dropna()or replace gaps usingfillna(). - Create a time-series dashboard that uses
interpolate()to estimate and visualize missing sensor readings or financial data. - Deploy a machine learning preprocessor that automatically handles missing features in a training dataset with scikit-learn's
SimpleImputer.
Get started by describing your application idea. Replit Agent will write the code, test it, and fix issues for you automatically.
Common errors and challenges
Navigating missing data can introduce subtle bugs, but understanding these common challenges will help you avoid them.
Avoiding mistakes when comparing with np.nan values
A frequent mistake is trying to find missing values using a direct comparison like df['column'] == np.nan. This won't work because, by design, np.nan is not considered equal to anything, not even itself. This behavior ensures that missing values don't accidentally match in comparisons.
To correctly identify NaN values, you should always use functions built for this purpose. In Pandas, the isna() and notna() methods are your best tools. For NumPy arrays, use the np.isnan() function to create a reliable boolean mask.
Handling unexpected results with dropna() parameters
Sometimes, running dropna() can remove far more data than you intended, especially if missing values are scattered across many rows. The default behavior removes any row containing even a single NaN. You can control this with its parameters.
- Use
how='all'to only drop rows or columns where every single value is missing. - Use the
threshparameter to specify the minimum number of non-missing values a row or column must have to be kept. For example,thresh=3keeps rows that have at least three valid data points.
Debugging type errors after filling missing values
After you use fillna(), you might run into a TypeError later in your analysis. This often happens because filling missing values can change a column's data type, or dtype. For instance, a column of integers with a single NaN is stored as a float type; filling it with 0 will still leave it as a float column.
If you fill with a non-numeric value like the string 'missing', the entire column's dtype will change to object, which will cause errors with mathematical operations. Always check your column types with df.dtypes after filling data. If needed, you can convert a column back to its correct type using the astype() method, like df['column'].astype('int64').
Avoiding mistakes when comparing with np.nan values
You can't use the standard equality operator == to find np.nan values. This is because np.nan has a special property where it never equals itself. Attempting this comparison will return an empty result, as the following code demonstrates.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4]})
# Incorrect way to filter rows with NaN
filtered_df = df[df['A'] == np.nan]
print("Rows with NaN:", filtered_df)
Because np.nan isn't equal to anything, even itself, the expression df['A'] == np.nan returns False for every row. This leaves you with an empty DataFrame. The following code demonstrates the correct method for this check.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4]})
# Correct way to filter rows with NaN
filtered_df = df[df['A'].isna()]
print("Rows with NaN:", filtered_df)
The correct way to find missing values is with the isna() method. This function scans a Series and returns a boolean mask—True for every NaN and False otherwise. Using this mask as an index, as in df[df['A'].isna()], reliably selects only the rows with missing data. Always use this approach instead of direct comparison whenever you need to filter, count, or otherwise handle NaN values in your datasets to avoid empty results.
Handling unexpected results with dropna() parameters
The dropna() function can sometimes produce surprising results, leaving missing values in your dataset when you thought they'd be gone. This often happens when its parameters, like how, aren't configured for your specific goal. The code below shows this in action.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8]
})
# This drops only rows where ALL values are NaN
cleaned_df = df.dropna(how='all')
print("After dropna(how='all'):\n", cleaned_df)
By setting how='all', you're telling dropna() to only remove rows where every value is missing. Since no row is completely empty, the function doesn't drop anything. The following code demonstrates the correct approach.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8]
})
# This drops rows where ANY value is NaN
cleaned_df = df.dropna(how='any')
print("After dropna(how='any'):\n", cleaned_df)
The fix is to set the how parameter to 'any'. This instructs dropna() to remove a row if it contains even a single NaN value. It's the default setting and the most common use case for cleaning datasets. This approach is perfect when you need to work only with fully complete records. However, be careful, as it can aggressively remove data if missing values are scattered across many rows.
Debugging type errors after filling missing values
Filling missing values can sometimes lead to unexpected TypeError exceptions later in your code. This often happens because methods like fillna() can silently change a column's data type, or dtype, creating a mismatch with the operations you intend to perform.
For example, replacing a missing string with a number can convert the entire column, causing string methods to fail. The following code demonstrates how this can happen.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': ['A001', 'A002', 'A003', np.nan],
'value': [100, 200, 300, 400]
})
df['id'] = df['id'].fillna(0)
# Will cause error in string operations
result = df['id'].str.upper()
print(result)
The id column, which contains strings, is filled with the integer 0. When you apply a string method like str.upper(), it fails because the column now contains a number. The following code demonstrates the correct approach.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': ['A001', 'A002', 'A003', np.nan],
'value': [100, 200, 300, 400]
})
df['id'] = df['id'].fillna('UNKNOWN')
# Now string operations work properly
result = df['id'].str.upper()
print(result)
The fix is to fill the NaN with a value that matches the column's data type. Instead of using an integer like 0, the correct code uses the string 'UNKNOWN'. This keeps the id column's type consistent, allowing methods like str.upper() to run without a TypeError.
Always check your column's dtype after using fillna() to ensure it hasn't been unintentionally changed, especially before you perform type-specific operations.
Real-world applications
With these techniques and debugging strategies, you can confidently tackle real-world data challenges like cleaning sales reports and analyzing survey data.
Cleaning a sales dataset for reporting
To ensure your sales reports are accurate, you'll need to clean the raw data by using the fillna() method to handle missing values that would otherwise prevent calculations.
import pandas as pd
import numpy as np
# Sample sales dataset with missing values
sales = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=5),
'product': ['A', 'B', np.nan, 'A', 'C'],
'quantity': [10, np.nan, 15, 20, np.nan],
'price': [100, 200, np.nan, 100, 150]
})
# Clean the dataset for reporting
clean_sales = sales.copy()
clean_sales['product'] = clean_sales['product'].fillna('Unknown')
clean_sales[['quantity', 'price']] = clean_sales[['quantity', 'price']].fillna(0)
clean_sales['total'] = clean_sales['quantity'] * clean_sales['price']
print("Original sales data:\n", sales)
print("\nCleaned sales data for reporting:\n", clean_sales)
This code demonstrates a practical approach to cleaning a sales dataset. It first makes a copy of the sales DataFrame to keep the original data intact. Different cleaning strategies are then applied to different columns based on their data type.
- The categorical
productcolumn has its missing values replaced with the string'Unknown'. - The numerical
quantityandpricecolumns are filled with0usingfillna().
After cleaning, a new total column is successfully calculated by multiplying the now-complete quantity and price columns.
Using multiple imputation techniques for survey data
Multiple imputation is particularly effective for survey data, as it fills gaps by modeling the relationships between variables to produce more accurate and realistic estimates.
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Create a sample survey dataset with missing values
np.random.seed(42)
survey = pd.DataFrame({
'age': [25, 30, np.nan, 45, 50, np.nan, 35],
'income': [50000, np.nan, 75000, 60000, np.nan, 80000, 65000],
'satisfaction': [7, 8, 6, np.nan, 9, 7, np.nan],
'years_customer': [2, 5, np.nan, 8, 10, 4, 6]
})
print("Original survey data with missing values:")
print(survey)
# Use MICE (Multiple Imputation by Chained Equations)
mice_imputer = IterativeImputer(max_iter=10, random_state=42)
survey_mice = pd.DataFrame(
mice_imputer.fit_transform(survey),
columns=survey.columns
)
print("\nAfter MICE imputation:")
print(survey_mice.round(1))
This example uses scikit-learn's IterativeImputer to intelligently fill missing survey data. It works by treating each column with missing values as a prediction problem, using the other columns as features to estimate the missing entries.
- The
fit_transform()method applies this modeling process to the entire dataset. - It cycles through the columns multiple times—controlled by
max_iter—to refine its estimates.
This creates a complete dataset where the filled values are contextually aware, based on patterns in the existing data.
Get started with Replit
Turn these techniques into a real tool. Tell Replit Agent: “Build a utility to clean CSVs using fillna()” or “Create a dashboard that uses interpolate() to fix time-series data.”
The agent writes the code, tests for errors, and deploys your application automatically. Start building with Replit.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.


.png)
.png)