How to read multiple CSV files in Python

Learn how to read multiple CSV files in Python. Discover different methods, tips, real-world applications, and how to debug common errors.

Published on:

Wed

Mar 25, 2026

Updated on:

Thu

Mar 26, 2026

The Replit Team

ON THIS PAGE

Example H2

To analyze large datasets, you must often read multiple CSV files in Python. This common data science task is simple with the right libraries and techniques for efficient data aggregation.

In this article, you'll explore various techniques to manage this process. You will find practical tips, real-world applications, and debug advice to help you select the best approach for your specific needs.

Using a basic loop to read multiple CSV files

import pandas as pd import os csv_files = ['file1.csv', 'file2.csv', 'file3.csv'] dataframes = [] for file in csv_files: df = pd.read_csv(file) dataframes.append(df)--OUTPUT--[DataFrame with shape (x, y), DataFrame with shape (x, y), DataFrame with shape (x, y)]

The most direct way to handle multiple files is to loop through them. This code iterates over a list of filenames, using the pandas read_csv() function to load each one into a DataFrame. It's a simple and effective method when you know exactly which files you need to process.

Each DataFrame is then added to a list called dataframes. This step creates a collection of separate datasets, preparing them for the next stage—typically combining them into a single, larger DataFrame for analysis.

Efficient file selection and combination

Instead of manually listing every file, you can automate file discovery and combine your datasets more efficiently with a few powerful techniques.

Using `glob` to find CSV files by pattern

import pandas as pd import glob csv_files = glob.glob('data/*.csv') dataframes = [] for file in csv_files: df = pd.read_csv(file) dataframes.append(df)--OUTPUT--Found files: ['data/file1.csv', 'data/file2.csv', 'data/file3.csv'] 3 dataframes have been read

The glob module offers a more dynamic way to find files. Instead of hardcoding filenames, you can use the glob.glob() function to locate all files matching a specific pattern. It's especially useful when your directory contains many CSV files.

The argument 'data/*.csv' instructs Python to search within the data directory.
The * wildcard matches any filename, as long as it ends with .csv.

This method automatically creates a list of file paths, making your code cleaner and more adaptable to changes in your dataset.

Using list comprehension for cleaner code

import pandas as pd import glob csv_files = glob.glob('data/*.csv') dataframes = [pd.read_csv(file) for file in csv_files]--OUTPUT--[DataFrame with shape (x, y), DataFrame with shape (x, y), DataFrame with shape (x, y)]

List comprehension offers a more concise way to build your list of DataFrames. The line dataframes = [pd.read_csv(file) for file in csv_files] condenses the for loop and list creation into a single, readable statement. It’s a Pythonic approach that many developers prefer for its elegance.

This method achieves the same result as the standard loop but with less code.
The logic reads like plain English: create a list of DataFrames by reading each file in the csv_files list.

Combining multiple CSV files with `pd.concat()`

import pandas as pd import glob csv_files = glob.glob('data/*.csv') combined_df = pd.concat([pd.read_csv(file) for file in csv_files], ignore_index=True) print(f"Combined shape: {combined_df.shape}")--OUTPUT--Combined shape: (300, 5)

Once you have a list of DataFrames, the final step is to merge them. The pd.concat() function handles this by taking your list of DataFrames and stacking them vertically into a single, unified dataset.

This approach efficiently combines the file reading and concatenation into one line.
The ignore_index=True argument is key. It creates a new, clean index for the final DataFrame, which prevents duplicate index values from the original files.

Advanced CSV processing techniques

While the methods above work well for many datasets, you'll need more powerful techniques to handle the performance and memory demands of large-scale CSV processing.

Processing CSV files in parallel with `concurrent.futures`

import pandas as pd import glob from concurrent.futures import ThreadPoolExecutor csv_files = glob.glob('data/*.csv') with ThreadPoolExecutor(max_workers=4) as executor: dataframes = list(executor.map(pd.read_csv, csv_files))--OUTPUT--[DataFrame with shape (x, y), DataFrame with shape (x, y), DataFrame with shape (x, y)]

For large numbers of files, sequential reading can become a bottleneck. The concurrent.futures module helps by processing files in parallel. Using ThreadPoolExecutor creates a pool of worker threads, allowing you to read multiple CSVs simultaneously instead of one after another.

The ThreadPoolExecutor is initialized with a set number of max_workers—in this case, four threads that can run tasks concurrently.
The executor.map() function applies pd.read_csv to every file in your list, distributing the work across the available threads for a significant speed boost.

Handling large CSV files with chunking

import pandas as pd chunk_size = 10000 chunks = [] for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size): # Process each chunk here chunks.append(chunk) full_df = pd.concat(chunks, ignore_index=True)--OUTPUT--Full dataframe shape: (1000000, 5)

When a CSV file is too large to fit into memory, you can process it in smaller pieces. The chunksize parameter in pd.read_csv() lets you read the file incrementally. Instead of a single DataFrame, the function returns an iterator that yields data in manageable chunks.

You can loop through this iterator to process each chunk individually before appending it to a list.
Once all chunks are processed, pd.concat() combines them into a complete DataFrame.
This approach lets you work with massive datasets without running out of memory.

Using `dask` for out-of-memory CSV processing

import dask.dataframe as dd import glob csv_files = glob.glob('data/*.csv') dask_df = dd.read_csv(csv_files) result = dask_df.describe().compute() print(result)--OUTPUT--column1 column2 column3 count 1000000.000000 1000000.0 1000000.0 mean 50.000000 25.0 75.0 std 10.000000 5.0 15.0 min 0.000000 0.0 0.0 25% 25.000000 12.5 37.5 50% 50.000000 25.0 75.0 75% 75.000000 37.5 112.5 max 100.000000 50.0 150.0

For datasets that are too large for your computer's memory, Dask is an excellent solution. Unlike pandas, Dask's dd.read_csv() function doesn't load everything at once. It performs lazy evaluation, creating a Dask DataFrame that outlines the steps for processing your files without actually executing them.

Operations like describe() are also lazy. They simply add another step to the computation plan.
The compute() method is what finally triggers the execution, allowing Dask to intelligently read and process the data in manageable chunks.

This lets you run complex analyses on massive datasets without running out of memory.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

For the CSV processing techniques we've explored, Replit Agent can turn them into production tools:

A sales data dashboard that automatically finds and merges daily sales CSVs using glob and pd.concat() to display real-time revenue trends.
A log analysis utility that processes large log files in parallel from multiple sources, similar to using concurrent.futures, to identify and flag system errors.
A market research tool that aggregates survey results from various CSV datasets, using Dask-like methods for out-of-memory analysis to generate a consolidated report.

Describe your app idea, and the agent writes the code, tests it, and fixes issues automatically, all in your browser. Try Replit Agent to turn your data project into a live application.

Common errors and challenges

Even with the right tools, you might run into a few common roadblocks when reading multiple CSV files in Python.

Handling files with inconsistent column names

One of the most frequent issues is dealing with files that have inconsistent column names or a different column order. When you try to combine these DataFrames using pd.concat(), pandas will create separate columns for each unique name, leading to a messy and incorrect final dataset.

To fix this, you can standardize the column names before concatenation. A common approach is to read each file, convert all column headers to a consistent format like lowercase, and then combine them.
Alternatively, if you only need a subset of columns, you can specify them with the usecols parameter when reading the files. This ensures you only load the data you need with consistent naming.

Dealing with `FileNotFoundError` when reading multiple files

A FileNotFoundError is another classic hurdle, typically popping up when your script can't locate a file at the specified path. This often happens when using a relative path or a glob pattern that doesn't match any files in the current working directory.

Always double-check your file paths and patterns. Ensure the directory structure matches what's in your code.
Using absolute paths or constructing them programmatically with a library like os can help make your script more robust and prevent these errors.

Preventing memory issues with selective column loading

Even when processing files in chunks, you can still run into memory problems if the individual CSVs are wide—meaning they have many columns. Loading all columns from a large file, even a portion of its rows, can quickly exhaust your available RAM.

A simple and effective solution is to load only the columns you actually need for your analysis.
The pd.read_csv() function includes a usecols parameter, which lets you provide a list of column names to import. This dramatically reduces memory consumption by telling pandas to ignore everything else.

Handling files with inconsistent column names

It’s a common problem: your CSV files have slightly different column names. When you try to merge them, pd.concat() can't align the data correctly. The code below demonstrates how this misalignment creates a messy DataFrame filled with unwanted NaN values.

import pandas as pd import glob csv_files = glob.glob('data/*.csv') # This will fail if files have different columns combined_df = pd.concat([pd.read_csv(file) for file in csv_files])

This code reads and combines files in a single step. Because pd.concat() matches data by column names, any inconsistencies force it to create separate columns, leading to gaps. The corrected approach is shown below.

import pandas as pd import glob csv_files = glob.glob('data/*.csv') # Use join='outer' to include all columns from all files combined_df = pd.concat([pd.read_csv(file) for file in csv_files], join='outer')

The corrected code uses the join='outer' parameter. This tells pd.concat() to include every column from all your files, creating a union of the datasets. If a file is missing a particular column, pandas simply fills in the gaps with NaN values. This method is great for ensuring no data is lost during the merge, though you may need to handle the resulting empty cells in your final DataFrame.

Dealing with `FileNotFoundError` when reading multiple files

The FileNotFoundError is a common roadblock that will stop your script cold. It happens when a file path is wrong or a file is missing from your list. The code below demonstrates how one missing file crashes the entire process.

import pandas as pd csv_files = ['file1.csv', 'file2.csv', 'missing_file.csv'] dataframes = [pd.read_csv(file) for file in csv_files] # Will crash on missing file

The list comprehension attempts to execute pd.read_csv() for every file in the list. The moment it encounters a path that doesn't exist, like 'missing_file.csv', the entire operation fails. The code below shows how to handle this gracefully.

import pandas as pd import os csv_files = ['file1.csv', 'file2.csv', 'missing_file.csv'] dataframes = [] for file in csv_files: if os.path.exists(file): dataframes.append(pd.read_csv(file)) else: print(f"Warning: {file} not found, skipping")

The corrected code gracefully handles missing files by checking each path with os.path.exists() before trying to read it. Instead of crashing, the for loop prints a warning and skips any file that can't be found. This makes your script more robust, allowing it to continue processing the remaining files. It's a simple but effective way to prevent errors when your file list might be unreliable or generated automatically.

Preventing memory issues with selective column loading

Reading every column from multiple large files is a surefire way to hit a memory limit. When you combine files using a list comprehension with pd.read_csv(), you're loading everything at once. The code below shows this common but risky approach.

import pandas as pd import glob # This might cause memory errors with many large files csv_files = glob.glob('large_data/*.csv') all_data = pd.concat([pd.read_csv(file) for file in csv_files])

This one-liner loads all data at once. The list comprehension creates a complete DataFrame for each file before pd.concat() combines them, which can exhaust your memory. The code below demonstrates a more strategic and memory-efficient approach.

import pandas as pd import glob csv_files = glob.glob('large_data/*.csv') column_subset = ['id', 'value', 'category'] # Only read needed columns all_data = pd.concat([ pd.read_csv(file, usecols=column_subset, low_memory=True) for file in csv_files ])

The corrected approach is far more memory-efficient. By using the usecols parameter in pd.read_csv(), you instruct pandas to load only the columns you specify.

This dramatically reduces memory usage, especially with "wide" datasets that have many columns.
You simply define a list of required column names—like column_subset—and pass it to usecols.
It's a critical technique when your analysis only requires a fraction of the available data.

Real-world applications

Now that you can reliably load your data, you can tackle common analysis tasks like grouping with groupby() and merging with pd.merge().

Analyzing data across multiple CSV files with `groupby()`

After combining your CSVs, you can use the groupby() method to aggregate the data and perform calculations on specific categories.

import pandas as pd import glob sales_files = glob.glob('regional_sales/*.csv') all_sales = pd.concat([pd.read_csv(file) for file in sales_files], ignore_index=True) top_products = all_sales.groupby('product_name')['quantity'].sum().sort_values(ascending=False).head(5) print(top_products)

This code snippet first combines all regional sales CSVs into a single DataFrame. It’s a common pattern for preparing data for analysis. The real power comes from the chained pandas methods used to find the best selling products.

The groupby('product_name') method organizes all sales records by product.
Next, ['quantity'].sum() calculates the total quantity sold for each unique product.
Finally, sort_values(ascending=False) and head(5) work together to rank the products and select the top five performers.

Merging and cleaning CSV data with `pd.merge()`

The pd.merge() function is perfect for combining different datasets, like customer records and transaction logs, especially after you’ve cleaned a shared column to ensure the data aligns correctly.

import pandas as pd import glob crm_files = glob.glob('crm_exports/*.csv') crm_data = pd.concat([pd.read_csv(f) for f in crm_files], ignore_index=True) transactions = pd.read_csv('accounting/transactions.csv') crm_data['email'] = crm_data['email'].str.lower().str.strip() transactions['customer_email'] = transactions['customer_email'].str.lower().str.strip() customer_transactions = pd.merge( crm_data, transactions, left_on='email', right_on='customer_email', how='inner' ) print(f"Matched transactions: {len(customer_transactions)}")

This script demonstrates a practical data integration workflow. It begins by consolidating multiple CRM export files into a single DataFrame and loading a separate file of transaction records. The key step is standardizing the data before combining it.

It cleans the email columns in both datasets by converting text to lowercase and stripping whitespace. This ensures that variations like " [email protected] " and "[email protected]" are treated as identical for accurate matching.
Finally, pd.merge() performs an inner join, creating a new dataset that includes only the customers who appear in both the CRM and transaction systems.

Get started with Replit

Turn these techniques into a real tool with Replit Agent. Try prompts like, “Build a dashboard that merges daily sales CSVs” or “Create a utility to find and combine all log files.”

The agent writes the code, tests for errors, and deploys your app right from your browser. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free