How to create a dataframe in Python

Learn how to create a Python DataFrame with our guide. We cover various methods, tips, real-world uses, and common error debugging.

How to create a dataframe in Python
Published on: 
Fri
Feb 6, 2026
Updated on: 
Tue
Feb 10, 2026
The Replit Team Logo Image
The Replit Team

To work with data in Python, you need a dataframe. This structure is fundamental for data analysis and offers a powerful way to organize and manipulate information for complex tasks.

You'll discover several techniques to create dataframes, with practical tips for implementation. You will also explore real-world applications and get advice to debug common errors you might encounter.

Basic dataframe creation with pandas

import pandas as pd

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]})
print(df)--OUTPUT--Name Age
0 Alice 25
1 Bob 30
2 Charlie 35

The most direct way to create a dataframe is by passing a Python dictionary to the pd.DataFrame() constructor. It's an efficient method for small, manually entered datasets. The dictionary's keys, such as 'Name' and 'Age', automatically become the column headers.

The corresponding values must be lists of equal length. These lists populate the rows for each column, ensuring your data is organized logically from the start. Each list represents a complete column of data points.

Common ways to create dataframes

Along with the dictionary-of-lists method, you can also create dataframes from other common structures like a list of dictionaries or a NumPy array.

Create from a dictionary of lists

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Score': [85, 92, 78]}
df = pd.DataFrame(data)
print(df)--OUTPUT--Name Score
0 Alice 85
1 Bob 92
2 Charlie 78

This approach is one of the most common and readable ways to build a dataframe from scratch. You're essentially defining your dataset column by column within a standard Python dictionary before passing it to the constructor.

  • Each key, like 'Name' or 'Score', sets a column's title.
  • The associated list of values populates the rows for that column.

This method is particularly effective when your data is already organized by feature or category, as the dictionary maps directly to the final table structure.

Create from a list of dictionaries

import pandas as pd

data = [
{'Name': 'Alice', 'Score': 85},
{'Name': 'Bob', 'Score': 92},
{'Name': 'Charlie', 'Score': 78}
]
df = pd.DataFrame(data)
print(df)--OUTPUT--Name Score
0 Alice 85
1 Bob 92
2 Charlie 78

This approach works well when your data is organized as a collection of records, which is common for data from APIs or JSON files. You're essentially building the dataframe row by row.

  • Each dictionary in the list corresponds to a single row.
  • The dictionary keys, like 'Name' and 'Score', become the column headers.

pandas automatically aligns the data, making this a very intuitive way to construct a dataframe from individual entries.

Create from NumPy arrays

import pandas as pd
import numpy as np

data = np.array([['Alice', 85], ['Bob', 92], ['Charlie', 78]])
df = pd.DataFrame(data, columns=['Name', 'Score'])
print(df)--OUTPUT--Name Score
0 Alice 85
1 Bob 92
2 Charlie 78

When your data comes from numerical computations, it's often stored in a NumPy array. You can create a dataframe directly from this structure, which is a seamless way to transition from libraries like NumPy or SciPy into pandas for more advanced data manipulation.

  • You pass the np.array as the first argument to pd.DataFrame().
  • Since arrays don't have inherent labels, you must explicitly name your columns using the columns parameter.

Advanced dataframe creation techniques

For more granular control over your data's organization, you can create dataframes with custom labels, multi-level indices, or directly from a pandas Series.

Create with custom index and column labels

import pandas as pd

data = [[85, 90], [92, 88], [78, 85]]
df = pd.DataFrame(data,
index=['Alice', 'Bob', 'Charlie'],
columns=['Math', 'Science'])
print(df)--OUTPUT--Math Science
Alice 85 90
Bob 92 88
Charlie 78 85

You're not limited to default numerical indices. For more descriptive dataframes, you can specify your own labels for both rows and columns during creation. This makes your data much easier to read and reference later on.

  • The index parameter lets you assign custom labels to each row, like 'Alice' or 'Bob'.
  • The columns parameter sets the headers for each column, such as 'Math' and 'Science'.

Create with multi-level indices

import pandas as pd
import numpy as np

index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1)],
names=['Letter', 'Number'])
df = pd.DataFrame({'Value': [0.1, 0.2, 0.3]}, index=index)
print(df)--OUTPUT--Value
Letter Number
A 1 0.1
2 0.2
B 1 0.3

For more complex datasets, you can create a hierarchical index—also known as a MultiIndex. This is useful when your data has multiple levels of grouping, allowing for more advanced slicing and analysis.

  • You build the index using a function like pd.MultiIndex.from_tuples(), which takes a list of tuples. Each tuple defines the index labels for a single row.
  • The names parameter lets you label the index levels themselves, such as 'Letter' and 'Number', making your dataframe easier to understand.

Create from a Series object

import pandas as pd

s = pd.Series([85, 92, 78], index=['Alice', 'Bob', 'Charlie'], name='Score')
df = pd.DataFrame(s)
print(df)--OUTPUT--Score
Alice 85
Bob 92
Charlie 78

A pandas Series can be directly converted into a single-column dataframe. This is a handy shortcut when your data is already structured as an indexed list, effectively representing one column of a future table.

  • The Series index, like 'Alice' and 'Bob', is used for the dataframe's row labels.
  • The name attribute of the Series—in this case, 'Score'—becomes the column header.

This creates a clean, one-column table from your existing Series object with minimal code.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. You can describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

For the dataframe creation techniques we've explored, Replit Agent can turn them into production-ready tools:

  • Build a data entry tool that takes user input and organizes it into a structured table from a dictionary of lists.
  • Create a dashboard that fetches JSON data from an API and displays it by converting a list of dictionaries into a clean dataframe.
  • Deploy a scientific data visualizer that processes numerical results from a NumPy array and presents them in a labeled table.

Describe your app idea, and Replit Agent writes the code, tests it, and fixes issues automatically, all in your browser.

Common errors and challenges

Even with the right approach, you might run into a few common roadblocks when creating and manipulating dataframes.

A frequent hiccup is the TypeError that appears when you try to access a column with square brackets, like df['column']. This usually means your variable df isn't actually a dataframe. It might be a list or another object that doesn't support indexing with string keys. To debug this, check the variable's type with type(df) to confirm your dataframe was created successfully.

Sometimes a column's data type isn't what you expect. For instance, a column of numbers might be incorrectly read as strings, which prevents you from doing any math on it.

  • This often happens during data import if there are non-numeric characters mixed in.
  • You can fix this by explicitly converting the column using the astype() method, like df['Score'] = df['Score'].astype(float).
  • If the conversion fails, it's a sign you need to clean the data first to remove any problematic values.

If you merge two dataframes and see a lot of NaN values, it's likely due to mismatched keys. Pandas inserts NaN (Not a Number) when it can't find a matching value in the other dataframe for the key you're joining on. The fix is to ensure the key columns are identical in both dataframes—check for subtle differences like data types, extra spaces, or different capitalization before you merge.

Debugging type errors when accessing columns with []

Debugging type errors when accessing columns with []

While bracket notation ([]) is reliable, you might be tempted to use dot notation for quicker access. However, this can lead to an AttributeError if your column names contain spaces or special characters. The following code shows what happens.

import pandas as pd

df = pd.DataFrame({'First Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]})
# This will fail with an AttributeError
first_names = df.First Name
print(first_names)

The space in 'First Name' breaks the dot notation. Python reads df.First and the word Name as two separate items, which is invalid syntax and triggers the error. The code below shows the correct way to access it.

import pandas as pd

df = pd.DataFrame({'First Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]})
# Use bracket notation for column names with spaces
first_names = df['First Name']
print(first_names)

Dot notation is a convenient shortcut, but it fails when column names contain spaces or special characters. Python interprets df.First Name as an attempt to access an attribute named First on the dataframe, leading to an AttributeError.

To avoid this, always use bracket notation like df['First Name']. It's the most reliable way to access columns, as it handles any valid string as a column name, preventing unexpected errors and keeping your code robust.

Fixing column data type issues with astype()

Fixing column data type issues with astype()

It’s common for data imported from files to be read as strings, even when it looks numeric. This causes unexpected behavior when you perform calculations, as pandas will treat the operation as string manipulation instead of a mathematical one. The code below shows what happens when you try to multiply a column of string-based numbers—notice how it produces an unintended result.

import pandas as pd

data = {'ID': ['1', '2', '3'], 'Value': ['100', '200', '300']}
df = pd.DataFrame(data)
# This won't give the expected result because 'Value' is string type
result = df['Value'] * 2
print(result)

Since the 'Value' column contains strings, the * 2 operation repeats the text rather than performing a mathematical calculation. This is why you get an output like '100100' instead of 200. The following code shows how to fix this.

import pandas as pd

data = {'ID': ['1', '2', '3'], 'Value': ['100', '200', '300']}
df = pd.DataFrame(data)
# Convert 'Value' column to integer before multiplication
df['Value'] = df['Value'].astype(int)
result = df['Value'] * 2
print(result)

To fix this, you'll need to explicitly change the column's data type. The astype() method is your tool for the job.

  • By converting the column with df['Value'].astype(int), you ensure that operations like multiplication (* 2) perform correct mathematical calculations.

This is a crucial step after importing data, especially from files, where numbers are often misread as text. Checking your data types early can prevent many headaches.

Resolving NaN values when merging with mismatched keys

Resolving NaN values when merging with mismatched keys

When you merge two dataframes, you might find your new table filled with NaN (Not a Number) values. This usually happens when the key columns you're joining on don't match perfectly, leaving pandas unable to link the rows.

For example, even a simple difference in capitalization can cause the entire merge to fail. The following code demonstrates what happens when the keys in two dataframes don't align because of case sensitivity.

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['a', 'b', 'c'], 'value': [4, 5, 6]})
# This will result in all NaN values due to case mismatch
merged = pd.merge(df1, df2, on='key')
print(merged)

The pd.merge() function is case-sensitive, so it can't match the uppercase keys in df1 with the lowercase keys in df2. The following code shows how to align the keys for a successful merge.

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['a', 'b', 'c'], 'value': [4, 5, 6]})
# Convert keys to the same case before merging
df1['key'] = df1['key'].str.lower()
df2['key'] = df2['key'].str.lower()
merged = pd.merge(df1, df2, on='key', suffixes=('_1', '_2'))
print(merged)

To fix this, you need to standardize the key columns before the merge. The solution converts the keys in both dataframes to lowercase using .str.lower(), ensuring they match perfectly. This prevents NaN values from appearing where matches should exist.

  • The suffixes parameter handles columns with the same name in both dataframes, like 'value', by renaming them to 'value_1' and 'value_2'.
  • Always check your key columns for inconsistencies before merging.

Real-world applications

Beyond creating and fixing dataframes, you can now use them for practical tasks like analyzing sales or segmenting customer data.

Reading and analyzing sales data with groupby

To analyze sales performance by product, you can use the groupby() method to segment your data and calculate summary statistics like total sales amount.

import pandas as pd

sales_data = {'Product': ['A', 'B', 'A', 'C', 'B', 'A'],
'Amount': [100, 200, 150, 300, 250, 175]}
sales_df = pd.DataFrame(sales_data)
product_sales = sales_df.groupby('Product').sum()['Amount']
print(product_sales)

This code snippet aggregates sales data to find the total sales for each product. It's a common pattern for summarizing information in a table.

  • First, groupby('Product') organizes the dataframe, bundling all rows that share the same product ID.
  • Next, sum() is applied to each of these groups, calculating the total for the 'Amount' column.

The final selection ['Amount'] isolates these totals, giving you a concise summary of sales per product.

Merging datasets for customer segment analysis

By merging customer data with transaction records, you can perform a detailed analysis of spending habits for each customer segment.

This is done by combining two separate dataframes—one for customers and another for orders—using the pd.merge() function. The function joins the tables on a shared column, in this case 'CustomerID', linking each order to its corresponding customer details.

Once the data is merged, you can quickly summarize it:

  • The groupby('Segment') method organizes the combined data, so all orders from 'Premium' customers are bundled together, and the same for 'Standard' customers.
  • The agg() method then runs multiple calculations at once. Here, it computes both the total ('sum') and average ('mean') spending for the 'Amount' column within each segment.

import pandas as pd

customers = pd.DataFrame({
'CustomerID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Segment': ['Premium', 'Standard', 'Premium', 'Standard']
})
orders = pd.DataFrame({
'OrderID': [101, 102, 103, 104, 105],
'CustomerID': [1, 3, 3, 2, 1],
'Amount': [150, 200, 300, 50, 100]
})
merged_data = pd.merge(orders, customers, on='CustomerID')
segment_analysis = merged_data.groupby('Segment')['Amount'].agg(['sum', 'mean'])
print(segment_analysis)

Here, you're taking raw data from two sources and turning it into actionable business intelligence. The process creates a master table where every transaction is enriched with customer segment data. This sets the stage for a powerful aggregation.

  • By grouping this enriched data, you distill complex transaction lists into a simple summary.
  • The final output directly compares the spending habits of different customer segments, making it easy to see which groups are most valuable.

Get started with Replit

Put your knowledge into practice by building a tool. Describe what you want to Replit Agent, like “a utility that cleans CSV data” or “a dashboard that visualizes sales figures.”

The Agent writes the code, tests for errors, and deploys the app, turning your description into a finished product. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.