How to merge dataframes in Python

Learn to merge dataframes in Python. Explore different methods, tips, real-world applications, and how to debug common errors.

How to merge dataframes in Python
Published on: 
Tue
Mar 3, 2026
Updated on: 
Wed
Apr 1, 2026
The Replit Team

Dataframe merges in Python are a fundamental task for data analysis. The pandas library provides powerful functions, such as merge, to combine datasets with precision and control.

In this article, we'll explore various merge techniques and practical tips. You'll also find real-world applications and common debugging advice to help you master dataframe combinations for your projects.

Using pd.merge() to join dataframes

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})
merged_df = pd.merge(df1, df2, on='key')
print(merged_df)--OUTPUT--key value1 value2
0 A 1 4
1 B 2 5

The pd.merge() function combines df1 and df2 by matching values in the column specified by the on parameter—in this case, 'key'.

The function's default behavior is to perform an inner join. This means it only includes rows where the key exists in both dataframes, which is why the output shows:

  • Rows with keys 'A' and 'B' are kept because they appear in both.
  • The rows with keys 'C' and 'D' are dropped because they are unique to one of the dataframes.

Common methods for dataframe joining

In addition to the flexible pd.merge() function, pandas provides other methods like DataFrame.join() and pd.concat() for more specialized combination tasks.

Using DataFrame.join() for index-based merging

import pandas as pd

df1 = pd.DataFrame({'value1': [1, 2, 3]}, index=['A', 'B', 'C'])
df2 = pd.DataFrame({'value2': [4, 5, 6]}, index=['A', 'B', 'D'])
joined_df = df1.join(df2, how='inner')
print(joined_df)--OUTPUT--value1 value2
A 1 4
B 2 5

The DataFrame.join() method offers a straightforward way to combine dataframes based on their indices. It's a convenient alternative to pd.merge() when your data is already aligned by index labels instead of columns.

In this example, df1.join(df2) merges the two using their shared index values. The behavior is controlled by the how parameter:

  • With how='inner', the join keeps only index labels that appear in both dataframes—in this case, 'A' and 'B'.
  • Index labels 'C' and 'D' are dropped because they aren't shared.

Using pd.concat() to combine dataframes

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df2 = pd.DataFrame({'key': ['C', 'D'], 'value': [3, 4]})
concat_rows = pd.concat([df1, df2])
concat_cols = pd.concat([df1, df2], axis=1)
print("Rows:", concat_rows.shape[0], "Columns:", concat_cols.shape[1])
print(concat_rows)--OUTPUT--Rows: 4 Columns: 4
key value
0 A 1
1 B 2
0 C 3
1 D 4

The pd.concat() function is your go-to for stacking dataframes either vertically or horizontally. It doesn't align data based on shared values—it simply appends them. This makes it perfect for straightforward combination tasks.

  • The default behavior, axis=0, stacks dataframes on top of each other, which is how concat_rows is formed.
  • When you set axis=1, it combines them side-by-side, creating new columns.

Using SQL-style join operations with pd.merge()

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'D', 'E'], 'value2': [4, 5, 6]})
inner_join = pd.merge(df1, df2, on='key', how='inner')
left_join = pd.merge(df1, df2, on='key', how='left')
right_join = pd.merge(df1, df2, on='key', how='right')
print(f"Inner: {len(inner_join)}, Left: {len(left_join)}, Right: {len(right_join)}")--OUTPUT--Inner: 1, Left: 3, Right: 3

The pd.merge() function lets you replicate SQL-style joins using the how parameter. This gives you fine-grained control over which rows are included in the final combined dataset.

  • how='inner': It's the default behavior. This join keeps only rows where the key exists in both dataframes, which is why the result has just one entry for key 'A'.
  • how='left': This keeps all keys from the left dataframe (df1) and merges matching data from the right.
  • how='right': This does the opposite, keeping all keys from the right dataframe (df2) and merging from the left.

Advanced dataframe merging techniques

You can tackle more complex merges by using multiple columns as keys, managing name conflicts with suffixes, and verifying the join with indicator and validate.

Merging on multiple columns with compound keys

import pandas as pd

df1 = pd.DataFrame({
'key1': ['A', 'A', 'B'],
'key2': [1, 2, 1],
'value1': [100, 200, 300]
})
df2 = pd.DataFrame({
'key1': ['A', 'A', 'B'],
'key2': [1, 2, 2],
'value2': [400, 500, 600]
})
merged_df = pd.merge(df1, df2, on=['key1', 'key2'])
print(merged_df)--OUTPUT--key1 key2 value1 value2
0 A 1 100 400
1 A 2 200 500

When your data's uniqueness depends on more than one column, you can merge using a compound key. Simply pass a list of column names to the on parameter, and pandas will require a match across all of them.

  • In this case, a merge only occurs if a row's key1 and key2 values are identical in both dataframes.
  • This is why the pairs ('A', 1) and ('A', 2) are kept—they appear in both.
  • The combination ('B', 1) is dropped because it doesn't have a complete match in df2.

Handling duplicate column names with suffixes

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2], 'data': [10, 20]})
df2 = pd.DataFrame({'key': ['A', 'C'], 'value': [3, 4], 'data': [30, 40]})
merged_df = pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))
print(merged_df)--OUTPUT--key value_left data_left value_right data_right
0 A 1 10 3 30

When you merge dataframes that share column names—like value and data in this case—pandas needs a way to tell them apart. The suffixes parameter elegantly handles this by preventing naming conflicts in the final output.

  • You provide a tuple of strings, such as ('_left', '_right'), which pandas appends to the overlapping column names.
  • The first suffix is applied to columns from the left dataframe (df1), and the second is applied to those from the right (df2).

This simple step keeps your merged data organized and traceable to its source.

Using indicator and validate parameters for advanced merging

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value': [4, 5, 6]})
merged_df = pd.merge(
df1, df2, on='key', how='outer',
indicator=True, validate='1:1'
)
print(merged_df)--OUTPUT--key value_x value_y _merge
0 A 1.0 4.0 both
1 B 2.0 5.0 both
2 C 3.0 NaN left_only
3 D NaN 6.0 right_only

The indicator and validate parameters give you powerful tools for quality control during a merge. They help you understand the merge's outcome and ensure your data is clean.

  • Setting indicator=True adds a _merge column that explicitly shows the source of each row—whether it came from the left_only, right_only, or both dataframes.
  • The validate='1:1' argument acts as a data integrity check. It ensures that merge keys are unique in both dataframes, raising an error if they aren't and preventing unexpected duplicate rows.

Move faster with Replit

Mastering individual functions like pd.merge() is one thing, but building a complete application is another. Replit is an AI-powered development platform where all Python dependencies pre-installed, so you can skip the setup and start coding instantly.

Instead of just piecing together techniques, you can use Agent 4 to build a complete, working product directly from a description. It handles the coding, database connections, APIs, and deployment for you.

  • A customer data tool that merges sales records with support tickets to create a unified user profile.
  • An inventory dashboard that uses a left join to combine a product catalog with sales data, highlighting items that haven't sold.
  • A reporting utility that concatenates monthly financial dataframes into a single, comprehensive annual report.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Merging dataframes isn't always straightforward; you might run into issues with mismatched data types, case-sensitive keys, or unexpected duplicates.

Troubleshooting merge on object errors with mismatched data types

A common roadblock is an error message indicating you're trying to merge on mismatched data types, like an object and an integer. This happens because pandas can't match a string like '123' with a number like 123—it sees them as entirely different values.

To fix this, you first need to inspect the data types of your key columns using df.dtypes. Once you've identified the mismatch, you can use the astype() method to convert one of the columns, ensuring both keys share the same data type before you attempt the merge again.

Debugging missing data after merging with case-sensitive keys

You might notice that rows are missing from your merged output even when the keys appear to be identical. This is often because pandas merges are case-sensitive by default, meaning a key like 'UserA' in one dataframe will not match 'usera' in another.

The easiest way to resolve this is to standardize the case of your key columns before merging. Applying a method like .str.lower() to both key columns ensures that capitalization differences don't cause you to lose data during the join.

Fixing duplicate values in merge keys with drop_duplicates()

If your merged dataframe has an unexpectedly high number of rows, the culprit is likely duplicate values in your merge keys. When a key appears multiple times, pandas creates a Cartesian product, matching every instance from the left dataframe with every corresponding instance from the right.

You can prevent this by cleaning your dataframes beforehand. Use the drop_duplicates() method on your dataframes, specifying the key column with the subset parameter. This ensures each merge key is unique, giving you a clean, one-to-one or many-to-one merge without unintended row multiplication.

Troubleshooting merge on object errors with mismatched data types

You might get an empty result when merging, even if the keys seem to match. This is a classic sign of mismatched data types. For example, pandas won't match the integer 1 with the string '1'. See what happens below.

import pandas as pd

df1 = pd.DataFrame({'key': [1, 2, 3], 'value': ['a', 'b', 'c']})
df2 = pd.DataFrame({'key': ['1', '2', '4'], 'data': ['d', 'e', 'f']})

merged_df = pd.merge(df1, df2, on='key')
print(merged_df) # Empty dataframe

The key column in df1 holds integers, while df2 uses strings. Because pandas treats these types as distinct, the inner join finds no matches and returns an empty dataframe. The fix involves aligning these types, as shown below.

import pandas as pd

df1 = pd.DataFrame({'key': [1, 2, 3], 'value': ['a', 'b', 'c']})
df2 = pd.DataFrame({'key': ['1', '2', '4'], 'data': ['d', 'e', 'f']})

df2['key'] = df2['key'].astype(int)
merged_df = pd.merge(df1, df2, on='key')
print(merged_df)

The fix is to align the data types before the merge. By applying df2['key'].astype(int), you convert the string column to integers, making it compatible with the key in df1. With matching types, pandas can successfully join the rows. This issue often arises when combining data from different sources, such as CSV files where numbers might be imported as text. A quick check with .dtypes can save you from this headache.

Debugging missing data after merging with case-sensitive keys

Debugging missing data after merging with case-sensitive keys

You might find your merged dataframe is missing rows you expected to see. This often happens because pandas treats keys with different capitalization as completely separate values. For example, 'C001' won't match 'c001'. See what happens in this example.

import pandas as pd

customers = pd.DataFrame({
'customer_id': ['C001', 'C002', 'C003'],
'name': ['Alice', 'Bob', 'Charlie']
})

orders = pd.DataFrame({
'order_id': [1, 2, 3],
'customer_id': ['c001', 'c002', 'c004'], # Note lowercase
'amount': [100, 200, 150]
})

merged_df = pd.merge(customers, orders, on='customer_id')
print(f"Records after merge: {len(merged_df)}") # 0 records

The merge returns zero records because pandas is case-sensitive by default and can't match customer_id values like 'C001' with 'c001'. See how a small adjustment before the merge operation can fix this.

import pandas as pd

customers = pd.DataFrame({
'customer_id': ['C001', 'C002', 'C003'],
'name': ['Alice', 'Bob', 'Charlie']
})

orders = pd.DataFrame({
'order_id': [1, 2, 3],
'customer_id': ['c001', 'c002', 'c004'], # Note lowercase
'amount': [100, 200, 150]
})

customers['customer_id'] = customers['customer_id'].str.upper()
orders['customer_id'] = orders['customer_id'].str.upper()

merged_df = pd.merge(customers, orders, on='customer_id')
print(f"Records after merge: {len(merged_df)}") # 2 records

The solution is to standardize the case before merging. By applying .str.upper() to the customer_id column in both dataframes, you ensure that keys like 'C001' and 'c001' are treated as identical. This simple preprocessing step prevents data loss from case mismatches. This issue is common when combining data from different systems where string formatting isn't consistent, so it's a good habit to check your keys beforehand.

Fixing duplicate values in merge keys with drop_duplicates()

If your merge results in far more rows than you started with, you've likely run into an issue with duplicate keys. This happens when a key value isn't unique, causing pandas to generate extra rows for each match. The code below demonstrates this problem.

import pandas as pd

products = pd.DataFrame({
'product_id': [101, 102, 101], # Duplicate product_id
'product_name': ['Laptop', 'Phone', 'Tablet']
})

inventory = pd.DataFrame({
'product_id': [101, 102, 103],
'quantity': [5, 10, 8]
})

result = pd.merge(products, inventory, on='product_id')
print(result) # Shows duplicate rows for product_id 101

The products dataframe contains two entries for product_id 101. Pandas matches both of these with the single 101 entry in inventory, creating unwanted duplicate rows. The following code shows how to prevent this.

import pandas as pd

products = pd.DataFrame({
'product_id': [101, 102, 101], # Duplicate product_id
'product_name': ['Laptop', 'Phone', 'Tablet']
})

inventory = pd.DataFrame({
'product_id': [101, 102, 103],
'quantity': [5, 10, 8]
})

products_unique = products.drop_duplicates(subset=['product_id'])
result = pd.merge(products_unique, inventory, on='product_id')
print(result) # Clean result without duplicates

The solution is to clean your data before the merge. By applying drop_duplicates(subset=['product_id']), you ensure each key is unique in the products dataframe. This prevents pandas from creating a Cartesian product that multiplies rows, giving you a clean merge without unintended duplicates. This is a crucial step when dealing with data from sources that may contain repeated entries, like logs or user-generated content, so it's always a good idea to validate your keys first.

Real-world applications

Beyond debugging, these merge techniques are essential for solving real-world data problems, from analyzing customer purchases to enriching geographic data.

Analyzing customer purchases with pd.merge()

By merging sales and customer dataframes with pd.merge(), you can easily calculate total purchase amounts for different customer segments.

import pandas as pd

sales_df = pd.DataFrame({
'customer_id': [101, 102, 101, 103, 104],
'amount': [150.50, 200.75, 50.25, 300.00, 120.00]
})

customer_df = pd.DataFrame({
'customer_id': [101, 102, 103, 105],
'segment': ['Premium', 'Standard', 'Premium', 'Standard']
})

# Join sales with customer information
result = pd.merge(sales_df, customer_df, on='customer_id')
print(result.groupby('segment')['amount'].sum())

This code demonstrates a common pattern for enriching and summarizing data. It combines information from two separate dataframes to create a concise report.

  • The pd.merge() function first joins the sales and customer dataframes on their shared customer_id. This step attaches the correct customer segment to each sales record.
  • After merging, the chained groupby('segment')['amount'].sum() command reorganizes the data by segment and then calculates the total amount for each group, producing an aggregated summary.

Enriching geographic data with how='left' join for analysis

Using a how='left' join allows you to augment a primary dataset, like a list of cities, with related data from another source, such as state economic indicators, while ensuring every city remains in your final analysis.

import pandas as pd

cities_df = pd.DataFrame({
'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
'state': ['NY', 'CA', 'IL', 'TX', 'AZ'],
'population': [8.4, 4.0, 2.7, 2.3, 1.7] # in millions
})

metrics_df = pd.DataFrame({
'state': ['NY', 'CA', 'IL', 'TX', 'WA'],
'gdp_growth': [2.1, 3.5, 1.8, 4.2, 3.9],
'unemployment': [4.2, 4.1, 4.7, 3.8, 3.6]
})

# Enrich city data with state metrics using left join
enriched_df = pd.merge(cities_df, metrics_df, on='state', how='left')
enriched_df['econ_index'] = enriched_df['gdp_growth'] - enriched_df['unemployment']
print(enriched_df[['city', 'state', 'econ_index']].sort_values('econ_index', ascending=False))

This example enriches city data with state-level economic metrics. It uses pd.merge() with how='left' to ensure all original cities are kept in the final dataframe, even if their state lacks corresponding economic data.

  • A new econ_index column is calculated by subtracting unemployment from GDP growth, creating a simple economic score.
  • The final output is then sorted by this index, ranking the cities based on their state's economic health.

Get started with Replit

Turn your knowledge into a real tool with Replit Agent. Describe what you want, like "a script that merges user signups with purchase history" or "a tool that joins product specs with live inventory data."

The Agent handles the coding, testing, and deployment, turning your description into a finished app. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.