How to read multiple CSV files in Python
Learn how to read multiple CSV files in Python. Discover different methods, tips, real-world applications, and how to debug common errors.

To analyze large datasets, you must often read multiple CSV files in Python. This common data science task is simple with the right libraries and techniques for efficient data aggregation.
In this article, you'll explore various techniques to manage this process. You will find practical tips, real-world applications, and debug advice to help you select the best approach for your specific needs.
Using a basic loop to read multiple CSV files
import pandas as pd
import os
csv_files = ['file1.csv', 'file2.csv', 'file3.csv']
dataframes = []
for file in csv_files:
df = pd.read_csv(file)
dataframes.append(df)--OUTPUT--[DataFrame with shape (x, y), DataFrame with shape (x, y), DataFrame with shape (x, y)]
The most direct way to handle multiple files is to loop through them. This code iterates over a list of filenames, using the pandas read_csv() function to load each one into a DataFrame. It's a simple and effective method when you know exactly which files you need to process.
Each DataFrame is then added to a list called dataframes. This step creates a collection of separate datasets, preparing them for the next stage—typically combining them into a single, larger DataFrame for analysis.
Efficient file selection and combination
Instead of manually listing every file, you can automate file discovery and combine your datasets more efficiently with a few powerful techniques.
Using glob to find CSV files by pattern
import pandas as pd
import glob
csv_files = glob.glob('data/*.csv')
dataframes = []
for file in csv_files:
df = pd.read_csv(file)
dataframes.append(df)--OUTPUT--Found files: ['data/file1.csv', 'data/file2.csv', 'data/file3.csv']
3 dataframes have been read
The glob module offers a more dynamic way to find files. Instead of hardcoding filenames, you can use the glob.glob() function to locate all files matching a specific pattern. It's especially useful when your directory contains many CSV files.
- The argument
'data/*.csv'instructs Python to search within thedatadirectory. - The
*wildcard matches any filename, as long as it ends with.csv.
This method automatically creates a list of file paths, making your code cleaner and more adaptable to changes in your dataset.
Using list comprehension for cleaner code
import pandas as pd
import glob
csv_files = glob.glob('data/*.csv')
dataframes = [pd.read_csv(file) for file in csv_files]--OUTPUT--[DataFrame with shape (x, y), DataFrame with shape (x, y), DataFrame with shape (x, y)]
List comprehension offers a more concise way to build your list of DataFrames. The line dataframes = [pd.read_csv(file) for file in csv_files] condenses the for loop and list creation into a single, readable statement. It’s a Pythonic approach that many developers prefer for its elegance.
- This method achieves the same result as the standard loop but with less code.
- The logic reads like plain English: create a list of DataFrames by reading each file in the
csv_fileslist.
Combining multiple CSV files with pd.concat()
import pandas as pd
import glob
csv_files = glob.glob('data/*.csv')
combined_df = pd.concat([pd.read_csv(file) for file in csv_files], ignore_index=True)
print(f"Combined shape: {combined_df.shape}")--OUTPUT--Combined shape: (300, 5)
Once you have a list of DataFrames, the final step is to merge them. The pd.concat() function handles this by taking your list of DataFrames and stacking them vertically into a single, unified dataset.
- This approach efficiently combines the file reading and concatenation into one line.
- The
ignore_index=Trueargument is key. It creates a new, clean index for the final DataFrame, which prevents duplicate index values from the original files.
Advanced CSV processing techniques
While the methods above work well for many datasets, you'll need more powerful techniques to handle the performance and memory demands of large-scale CSV processing.
Processing CSV files in parallel with concurrent.futures
import pandas as pd
import glob
from concurrent.futures import ThreadPoolExecutor
csv_files = glob.glob('data/*.csv')
with ThreadPoolExecutor(max_workers=4) as executor:
dataframes = list(executor.map(pd.read_csv, csv_files))--OUTPUT--[DataFrame with shape (x, y), DataFrame with shape (x, y), DataFrame with shape (x, y)]
For large numbers of files, sequential reading can become a bottleneck. The concurrent.futures module helps by processing files in parallel. Using ThreadPoolExecutor creates a pool of worker threads, allowing you to read multiple CSVs simultaneously instead of one after another.
- The
ThreadPoolExecutoris initialized with a set number ofmax_workers—in this case, four threads that can run tasks concurrently. - The
executor.map()function appliespd.read_csvto every file in your list, distributing the work across the available threads for a significant speed boost.
Handling large CSV files with chunking
import pandas as pd
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
# Process each chunk here
chunks.append(chunk)
full_df = pd.concat(chunks, ignore_index=True)--OUTPUT--Full dataframe shape: (1000000, 5)
When a CSV file is too large to fit into memory, you can process it in smaller pieces. The chunksize parameter in pd.read_csv() lets you read the file incrementally. Instead of a single DataFrame, the function returns an iterator that yields data in manageable chunks.
- You can loop through this iterator to process each chunk individually before appending it to a list.
- Once all chunks are processed,
pd.concat()combines them into a complete DataFrame. - This approach lets you work with massive datasets without running out of memory.
Using dask for out-of-memory CSV processing
import dask.dataframe as dd
import glob
csv_files = glob.glob('data/*.csv')
dask_df = dd.read_csv(csv_files)
result = dask_df.describe().compute()
print(result)--OUTPUT--column1 column2 column3
count 1000000.000000 1000000.0 1000000.0
mean 50.000000 25.0 75.0
std 10.000000 5.0 15.0
min 0.000000 0.0 0.0
25% 25.000000 12.5 37.5
50% 50.000000 25.0 75.0
75% 75.000000 37.5 112.5
max 100.000000 50.0 150.0
For datasets that are too large for your computer's memory, Dask is an excellent solution. Unlike pandas, Dask's dd.read_csv() function doesn't load everything at once. It performs lazy evaluation, creating a Dask DataFrame that outlines the steps for processing your files without actually executing them.
- Operations like
describe()are also lazy. They simply add another step to the computation plan. - The
compute()method is what finally triggers the execution, allowing Dask to intelligently read and process the data in manageable chunks.
This lets you run complex analyses on massive datasets without running out of memory.
Move faster with Replit
Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.
For the CSV processing techniques we've explored, Replit Agent can turn them into production tools:
- A sales data dashboard that automatically finds and merges daily sales CSVs using
globandpd.concat()to display real-time revenue trends. - A log analysis utility that processes large log files in parallel from multiple sources, similar to using
concurrent.futures, to identify and flag system errors. - A market research tool that aggregates survey results from various CSV datasets, using Dask-like methods for out-of-memory analysis to generate a consolidated report.
Describe your app idea, and the agent writes the code, tests it, and fixes issues automatically, all in your browser. Try Replit Agent to turn your data project into a live application.
Common errors and challenges
Even with the right tools, you might run into a few common roadblocks when reading multiple CSV files in Python.
Handling files with inconsistent column names
One of the most frequent issues is dealing with files that have inconsistent column names or a different column order. When you try to combine these DataFrames using pd.concat(), pandas will create separate columns for each unique name, leading to a messy and incorrect final dataset.
- To fix this, you can standardize the column names before concatenation. A common approach is to read each file, convert all column headers to a consistent format like lowercase, and then combine them.
- Alternatively, if you only need a subset of columns, you can specify them with the
usecolsparameter when reading the files. This ensures you only load the data you need with consistent naming.
Dealing with FileNotFoundError when reading multiple files
A FileNotFoundError is another classic hurdle, typically popping up when your script can't locate a file at the specified path. This often happens when using a relative path or a glob pattern that doesn't match any files in the current working directory.
- Always double-check your file paths and patterns. Ensure the directory structure matches what's in your code.
- Using absolute paths or constructing them programmatically with a library like
oscan help make your script more robust and prevent these errors.
Preventing memory issues with selective column loading
Even when processing files in chunks, you can still run into memory problems if the individual CSVs are wide—meaning they have many columns. Loading all columns from a large file, even a portion of its rows, can quickly exhaust your available RAM.
- A simple and effective solution is to load only the columns you actually need for your analysis.
- The
pd.read_csv()function includes ausecolsparameter, which lets you provide a list of column names to import. This dramatically reduces memory consumption by telling pandas to ignore everything else.
Handling files with inconsistent column names
It’s a common problem: your CSV files have slightly different column names. When you try to merge them, pd.concat() can't align the data correctly. The code below demonstrates how this misalignment creates a messy DataFrame filled with unwanted NaN values.
import pandas as pd
import glob
csv_files = glob.glob('data/*.csv')
# This will fail if files have different columns
combined_df = pd.concat([pd.read_csv(file) for file in csv_files])
This code reads and combines files in a single step. Because pd.concat() matches data by column names, any inconsistencies force it to create separate columns, leading to gaps. The corrected approach is shown below.
import pandas as pd
import glob
csv_files = glob.glob('data/*.csv')
# Use join='outer' to include all columns from all files
combined_df = pd.concat([pd.read_csv(file) for file in csv_files], join='outer')
The corrected code uses the join='outer' parameter. This tells pd.concat() to include every column from all your files, creating a union of the datasets. If a file is missing a particular column, pandas simply fills in the gaps with NaN values. This method is great for ensuring no data is lost during the merge, though you may need to handle the resulting empty cells in your final DataFrame.
Dealing with FileNotFoundError when reading multiple files
The FileNotFoundError is a common roadblock that will stop your script cold. It happens when a file path is wrong or a file is missing from your list. The code below demonstrates how one missing file crashes the entire process.
import pandas as pd
csv_files = ['file1.csv', 'file2.csv', 'missing_file.csv']
dataframes = [pd.read_csv(file) for file in csv_files] # Will crash on missing file
The list comprehension attempts to execute pd.read_csv() for every file in the list. The moment it encounters a path that doesn't exist, like 'missing_file.csv', the entire operation fails. The code below shows how to handle this gracefully.
import pandas as pd
import os
csv_files = ['file1.csv', 'file2.csv', 'missing_file.csv']
dataframes = []
for file in csv_files:
if os.path.exists(file):
dataframes.append(pd.read_csv(file))
else:
print(f"Warning: {file} not found, skipping")
The corrected code gracefully handles missing files by checking each path with os.path.exists() before trying to read it. Instead of crashing, the for loop prints a warning and skips any file that can't be found. This makes your script more robust, allowing it to continue processing the remaining files. It's a simple but effective way to prevent errors when your file list might be unreliable or generated automatically.
Preventing memory issues with selective column loading
Reading every column from multiple large files is a surefire way to hit a memory limit. When you combine files using a list comprehension with pd.read_csv(), you're loading everything at once. The code below shows this common but risky approach.
import pandas as pd
import glob
# This might cause memory errors with many large files
csv_files = glob.glob('large_data/*.csv')
all_data = pd.concat([pd.read_csv(file) for file in csv_files])
This one-liner loads all data at once. The list comprehension creates a complete DataFrame for each file before pd.concat() combines them, which can exhaust your memory. The code below demonstrates a more strategic and memory-efficient approach.
import pandas as pd
import glob
csv_files = glob.glob('large_data/*.csv')
column_subset = ['id', 'value', 'category'] # Only read needed columns
all_data = pd.concat([
pd.read_csv(file, usecols=column_subset, low_memory=True)
for file in csv_files
])
The corrected approach is far more memory-efficient. By using the usecols parameter in pd.read_csv(), you instruct pandas to load only the columns you specify.
- This dramatically reduces memory usage, especially with "wide" datasets that have many columns.
- You simply define a list of required column names—like
column_subset—and pass it tousecols. - It's a critical technique when your analysis only requires a fraction of the available data.
Real-world applications
Now that you can reliably load your data, you can tackle common analysis tasks like grouping with groupby() and merging with pd.merge().
Analyzing data across multiple CSV files with groupby()
After combining your CSVs, you can use the groupby() method to aggregate the data and perform calculations on specific categories.
import pandas as pd
import glob
sales_files = glob.glob('regional_sales/*.csv')
all_sales = pd.concat([pd.read_csv(file) for file in sales_files], ignore_index=True)
top_products = all_sales.groupby('product_name')['quantity'].sum().sort_values(ascending=False).head(5)
print(top_products)
This code snippet first combines all regional sales CSVs into a single DataFrame. It’s a common pattern for preparing data for analysis. The real power comes from the chained pandas methods used to find the best selling products.
- The
groupby('product_name')method organizes all sales records by product. - Next,
['quantity'].sum()calculates the total quantity sold for each unique product. - Finally,
sort_values(ascending=False)andhead(5)work together to rank the products and select the top five performers.
Merging and cleaning CSV data with pd.merge()
The pd.merge() function is perfect for combining different datasets, like customer records and transaction logs, especially after you’ve cleaned a shared column to ensure the data aligns correctly.
import pandas as pd
import glob
crm_files = glob.glob('crm_exports/*.csv')
crm_data = pd.concat([pd.read_csv(f) for f in crm_files], ignore_index=True)
transactions = pd.read_csv('accounting/transactions.csv')
crm_data['email'] = crm_data['email'].str.lower().str.strip()
transactions['customer_email'] = transactions['customer_email'].str.lower().str.strip()
customer_transactions = pd.merge(
crm_data, transactions,
left_on='email', right_on='customer_email', how='inner'
)
print(f"Matched transactions: {len(customer_transactions)}")
This script demonstrates a practical data integration workflow. It begins by consolidating multiple CRM export files into a single DataFrame and loading a separate file of transaction records. The key step is standardizing the data before combining it.
- It cleans the email columns in both datasets by converting text to lowercase and stripping whitespace. This ensures that variations like " [email protected] " and "[email protected]" are treated as identical for accurate matching.
- Finally,
pd.merge()performs aninnerjoin, creating a new dataset that includes only the customers who appear in both the CRM and transaction systems.
Get started with Replit
Turn these techniques into a real tool with Replit Agent. Try prompts like, “Build a dashboard that merges daily sales CSVs” or “Create a utility to find and combine all log files.”
The agent writes the code, tests for errors, and deploys your app right from your browser. Start building with Replit.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

.png)
.png)
.png)