How to merge CSV files in Python
Learn how to merge CSV files in Python. This guide covers various methods, tips, real-world uses, and common error debugging.

You can merge multiple CSV files in Python, a common task when you need to consolidate data. This process combines separate files into a unified dataset for simpler analysis and management.
In this article, we'll show you several techniques to combine your data effectively. You'll find practical tips, explore real-world applications, and get advice to debug the common errors that can occur.
Using pandas to concatenate CSV files
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
merged_df = pd.concat([df1, df2])
merged_df.to_csv('merged_file.csv', index=False)
print(f"Merged CSV contains {len(merged_df)} rows")--OUTPUT--Merged CSV contains 300 rows
The pandas library simplifies this process with its pd.concat() function. This function takes a list of DataFrames—in this case, from your two CSV files—and stacks them vertically into a single DataFrame. It's an efficient way to combine datasets that share the same column structure.
When saving the result, using to_csv('merged_file.csv', index=False) is key. The index=False argument tells pandas not to write the DataFrame's index as a new column in your CSV file. This keeps your final dataset clean and avoids redundant data.
Basic CSV merging techniques
For situations where you don't need the overhead of pandas, Python's standard library provides simpler, more direct ways to combine your CSV data.
Merging CSV files with the csv module
import csv
with open('merged_file.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
for filename in ['file1.csv', 'file2.csv']:
with open(filename, 'r', newline='') as infile:
reader = csv.reader(infile)
if filename == 'file1.csv': # Write header only once
writer.writerows(reader)
else:
next(reader) # Skip header in subsequent files
writer.writerows(reader)--OUTPUT--# No console output, but creates a merged CSV file
This approach uses Python's built-in csv module for more granular control. The script iterates through your list of CSV files and writes their contents into a single output file.
- For the first file, it writes all rows, including the header.
- For all subsequent files, it uses
next(reader)to skip the header row, preventing duplicates in the final merged file.
This method is efficient for simple merges and doesn't require any external libraries.
Using glob to merge multiple CSV files
import pandas as pd
import glob
csv_files = glob.glob('data/*.csv')
merged_df = pd.concat([pd.read_csv(file) for file in csv_files])
merged_df.to_csv('merged_file.csv', index=False)
print(f"Merged {len(csv_files)} CSV files")--OUTPUT--Merged 5 CSV files
This approach is perfect for merging a large number of files without listing them manually. It combines the power of pandas with Python's built-in glob module to dynamically find and combine your data.
- The
glob.glob('data/*.csv')function finds all file paths in thedatadirectory that end with.csv. The asterisk (*) is a wildcard that matches any characters. - This list of file paths is then used in a list comprehension to create a list of DataFrames, which
pd.concat()merges into a single dataset.
Merging with standard file I/O operations
with open('file1.csv', 'r') as f1:
header = f1.readline()
file1_content = f1.read()
with open('merged_file.csv', 'w') as output:
output.write(header)
output.write(file1_content)
with open('file2.csv', 'r') as f2:
next(f2) # Skip header line
output.write(f2.read())--OUTPUT--# No console output, but creates a merged CSV file
This method offers a low-level approach, treating the CSV files as plain text. It's a straightforward way to stitch files together without any external libraries, giving you direct control over the process.
- The script first reads the header and content from
file1.csvseparately. - It then writes this header and content into a new file,
merged_file.csv. - Finally, it opens
file2.csv, skips its header usingnext(f2), and appends the remaining content to the merged file.
Advanced CSV merging methods
Beyond simple concatenation, you can tackle complex joins on key columns with merge(), process large files using chunksize, and even speed up the entire operation.
Joining CSV files with merge() on key columns
import pandas as pd
customers = pd.read_csv('customers.csv')
orders = pd.read_csv('orders.csv')
# Join customers with their orders
merged_data = pd.merge(customers, orders, on='customer_id', how='inner')
print(f"Merged data has {merged_data.shape[0]} rows and {merged_data.shape[1]} columns")
merged_data.to_csv('customer_orders.csv', index=False)--OUTPUT--Merged data has 120 rows and 8 columns
When you need to combine datasets that are related but not identical, pd.merge() is the right tool. Unlike concatenation, it performs a database-style join, linking rows from different files based on a shared key column. This is perfect for matching customer information with their corresponding orders.
- The
on='customer_id'argument specifies the common column to join the two DataFrames. how='inner'ensures that only records with a matchingcustomer_idin both thecustomersandordersfiles are included in the result.
Processing large CSV files with chunksize
import pandas as pd
chunk_size = 10000
merged_chunks = pd.DataFrame()
for chunk in pd.read_csv('large_file1.csv', chunksize=chunk_size):
merged_chunks = pd.concat([merged_chunks, chunk])
for chunk in pd.read_csv('large_file2.csv', chunksize=chunk_size):
merged_chunks = pd.concat([merged_chunks, chunk])
print(f"Processed data in chunks, total rows: {len(merged_chunks)}")
merged_chunks.to_csv('merged_large_files.csv', index=False)--OUTPUT--Processed data in chunks, total rows: 1250000
When working with massive CSV files, loading everything into memory at once can crash your program. The chunksize parameter in pd.read_csv() offers a memory-efficient solution. It processes the file in smaller, manageable pieces instead of all at once.
- The code iterates through each file, reading it in chunks of a specified size—in this case, 10,000 rows.
- Each chunk is appended to a final DataFrame using
pd.concat(), gradually building the complete merged dataset without overwhelming your system's memory.
Using multiprocessing for faster CSV merging
import pandas as pd
from multiprocessing import Pool
import glob
def process_file(filename):
return pd.read_csv(filename)
csv_files = glob.glob('data/*.csv')
with Pool(processes=4) as pool:
dfs = pool.map(process_file, csv_files)
merged_df = pd.concat(dfs)
print(f"Parallel processing complete. Merged {len(csv_files)} files with {len(merged_df)} total rows")
merged_df.to_csv('parallel_merged.csv', index=False)--OUTPUT--Parallel processing complete. Merged 5 files with 500000 total rows
To speed up merging, you can use Python's multiprocessing module. It's especially effective for I/O-bound tasks like reading files from a disk. This code creates a pool of worker processes that read multiple CSV files at the same time, which is much faster than reading them one by one.
- The
Pool(processes=4)line sets up four separate processes to work in parallel. - The
pool.map()function then assigns each process a file to read using theprocess_filefunction.
Once all processes finish, pd.concat() assembles the resulting DataFrames into a single dataset. This parallel approach can significantly cut down the total time needed for the merge.
Move faster with Replit
Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. There's no need to configure environments or install packages.
Instead of piecing together individual functions, you can use Agent 4 to build a complete application. Describe the tool you want, and the Agent will take it from a concept to a working product. For example, you could build:
- A sales dashboard that automatically merges daily CSV reports from different store locations.
- A data enrichment tool that joins new customer signups with marketing campaign data using a shared
customer_idcolumn. - A log processor that consolidates and cleans fragmented server logs from multiple CSV files into a single, unified dataset for analysis.
Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.
Common errors and challenges
Merging CSV files can introduce unexpected issues, but most common errors are straightforward to diagnose and resolve with the right approach.
When you use pd.concat(), you might run into trouble if your CSV files don't have the exact same columns. This happens if column names differ or if one file has extra columns that others lack.
By default, pandas performs an outer join, keeping all columns from all files and filling missing values with NaN (Not a Number). This can complicate your analysis. To manage this, you can:
- Use
join='inner'withinpd.concat()to keep only the columns that are present in every file. - Stick with the default behavior and clean up the resulting DataFrame by filling or dropping the
NaNvalues afterward.
Another common hiccup is when the same column has different data types across files—for example, a user_id column stored as a number in one CSV and as text in another.
When pandas merges inconsistent types, it often defaults to the flexible but less efficient object data type, which can prevent you from performing mathematical operations. The best fix is to enforce consistency by using the dtype parameter in pd.read_csv() to specify the correct data type from the start.
A UnicodeDecodeError is a sign that your CSV files are saved with different text encodings. This error means Python can't interpret a file's byte sequence because it's using the wrong character map, a common issue when files come from different operating systems or programs.
To resolve this, you need to identify each file's encoding and specify it with the encoding parameter in pd.read_csv(). While UTF-8 is a common standard, you may need to test other formats like latin1 or cp1252 to find the one that works.
Handling mismatched column errors when using pd.concat()
Using pd.concat() on files with different columns results in a DataFrame containing all columns from every file. Pandas fills the missing cells with NaN values, which can create a messy dataset that's difficult to analyze without further cleaning.
The code below demonstrates this by merging two files with different schemas and then counting the resulting null values.
import pandas as pd
# Files with different columns
df1 = pd.read_csv('sales.csv') # Columns: date, product, amount
df2 = pd.read_csv('inventory.csv') # Columns: product, quantity, location
# This will concatenate but keep all columns, creating NaN values
merged_df = pd.concat([df1, df2])
print(merged_df.isnull().sum())
Since the two files have different columns, pd.concat() fills the missing data with NaN values, creating an untidy result. The following code demonstrates how to manage this for a cleaner merge.
import pandas as pd
# Files with different columns
df1 = pd.read_csv('sales.csv') # Columns: date, product, amount
df2 = pd.read_csv('inventory.csv') # Columns: product, quantity, location
# Use only columns that are in both dataframes
common_columns = list(set(df1.columns).intersection(set(df2.columns)))
merged_df = pd.concat([df1[common_columns], df2[common_columns]])
print(f"Merged on common columns: {common_columns}")
This solution isolates the columns that both DataFrames share. It converts the column names into sets and finds their intersection(). By selecting only these common columns before concatenating, you ensure the final DataFrame is clean and free of NaN values. This is a great way to handle files from different sources where you only care about the overlapping data.
Dealing with inconsistent data types between CSV files
A frustrating issue occurs when a column has inconsistent data types across files, like 'revenue' stored as text in one and a number in another. While pd.concat() may succeed, subsequent calculations will fail. The code below shows this in action.
import pandas as pd
# Files with same columns but different data types
df1 = pd.read_csv('sales_2021.csv') # 'revenue' stored as string: "1,000.50"
df2 = pd.read_csv('sales_2022.csv') # 'revenue' stored as float: 1000.50
# This will concatenate but may cause issues with operations later
merged_df = pd.concat([df1, df2])
# This will raise an error due to inconsistent types
total_revenue = merged_df['revenue'].sum()
The sum() operation fails because you can't add a string value to a number. This type mismatch prevents mathematical calculations on the merged column. The code below shows how to fix this by standardizing the data first.
import pandas as pd
# Files with same columns but different data types
df1 = pd.read_csv('sales_2021.csv') # 'revenue' stored as string: "1,000.50"
df2 = pd.read_csv('sales_2022.csv') # 'revenue' stored as float: 1000.50
# Convert string to float in the first dataframe
df1['revenue'] = df1['revenue'].str.replace(',', '').astype(float)
# Now concatenation will work properly
merged_df = pd.concat([df1, df2])
total_revenue = merged_df['revenue'].sum()
print(f"Total revenue: ${total_revenue:.2f}")
This solution standardizes the data before merging. It targets the problematic 'revenue' column, first using .str.replace(',', '') to remove commas and then .astype(float) to convert the text to a number. By cleaning the data first, you ensure that pd.concat() creates a column with a consistent numeric type. This allows mathematical operations like sum() to execute without errors. It's a common issue when data comes from different sources or is entered manually.
Resolving encoding issues when merging CSV files
A UnicodeDecodeError often appears when merging files from different systems, as they may use incompatible text encodings. This error halts your script because Python can't interpret the file's byte sequence. The code below triggers this exact issue when reading two files.
import pandas as pd
# Trying to merge files with different encodings
try:
df1 = pd.read_csv('european_sales.csv') # UTF-8 encoding
df2 = pd.read_csv('asian_sales.csv') # May have a different encoding
merged_df = pd.concat([df1, df2])
print(f"Merged dataframe has {len(merged_df)} rows")
except UnicodeDecodeError:
print("Error: Could not decode file with the default encoding")
The try...except block catches the error because pd.read_csv() assumes a default encoding that doesn't match one of the files. This mismatch stops the script. The following code shows how to fix this during the read process.
import pandas as pd
# Explicitly specify encodings for each file
df1 = pd.read_csv('european_sales.csv', encoding='utf-8')
df2 = pd.read_csv('asian_sales.csv', encoding='iso-8859-1')
# Now concatenation works with the proper encodings
merged_df = pd.concat([df1, df2])
print(f"Successfully merged {len(df1)} and {len(df2)} rows")
This solution directly addresses the encoding mismatch by specifying the correct format for each file. By using the encoding parameter in pd.read_csv(), you tell pandas exactly how to interpret the files, preventing the UnicodeDecodeError. For example, one file might use 'utf-8' while another needs 'iso-8859-1'. This is a common problem when you're working with data from different geographic regions or legacy systems, so it's a good practice to check encodings upfront.
Real-world applications
Now that you know how to troubleshoot common issues, you can apply these merging techniques to solve real-world data problems.
Cleaning data while merging CSV files
Merging files is often the perfect time to clean your data, letting you handle duplicates, missing values, and incorrect types all in one go.
import pandas as pd
# Read files and merge
df1 = pd.read_csv('sales_q1.csv')
df2 = pd.read_csv('sales_q2.csv')
merged_df = pd.concat([df1, df2])
# Clean the merged data
merged_df = merged_df.drop_duplicates()
merged_df = merged_df.dropna(subset=['order_id', 'amount'])
merged_df['amount'] = merged_df['amount'].astype(float)
print(f"Clean merged data: {len(merged_df)} rows")
This code demonstrates a practical workflow for merging and cleaning data simultaneously. After combining two CSV files with pd.concat(), it applies three key cleaning steps to the resulting DataFrame.
- First,
drop_duplicates()removes any redundant rows. - Next,
dropna()discards records that lack essential data in theorder_idoramountcolumns. - Finally,
astype(float)standardizes theamountcolumn to a numeric format, enabling accurate mathematical operations.
Creating a dashboard from multiple CSV data sources
You can combine data from different business areas, like sales and inventory, to build a dashboard that visualizes key metrics.
The code below joins these two datasets using pd.merge() on a shared product_id. It then calculates the total revenue for each product, finds the top five, and uses matplotlib to plot the results in a bar chart, creating a simple but effective sales dashboard.
import pandas as pd
import matplotlib.pyplot as plt
# Load data from different departments
sales = pd.read_csv('sales.csv')
inventory = pd.read_csv('inventory.csv')
# Merge on product_id
combined = pd.merge(sales, inventory, on='product_id')
# Create a simple dashboard
top_products = combined.groupby('product_name')['revenue'].sum().nlargest(5)
top_products.plot(kind='bar')
plt.title('Top 5 Products by Revenue')
plt.savefig('sales_dashboard.png')
print(f"Dashboard created with data from {len(sales)} sales and {len(inventory)} inventory records")
This code demonstrates how to synthesize insights from different data sources. It starts by enriching sales records with inventory information using pd.merge() on the common product_id. The script then pivots from raw data to analysis.
- It uses
groupby()to aggregate sales figures for each unique product. - The
nlargest(5)method isolates the top performers. - Finally,
plot()andsavefig()create and save a visual report, making the findings easy to share.
Get started with Replit
Now, turn this knowledge into a tool. Tell Replit Agent to "build a dashboard that merges daily sales CSVs" or "create a utility to clean and combine log files from different servers."
The Agent writes the code, tests for errors, and deploys your app. Start building with Replit.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.
Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

.png)
.png)
.png)