How to do data analysis in Python
Your guide to data analysis in Python. Learn different methods, tips, real-world applications, and how to debug common errors.

Python is a top choice for data analysis. It offers powerful libraries to manipulate, process, and visualize data. Its clear syntax makes complex data tasks more manageable for everyone.
In this article, you'll explore key techniques and practical tips for your projects. You will also see real-world applications and get advice to debug common issues, so you can analyze data with confidence.
Basic data analysis with pandas
import pandas as pd
data = pd.read_csv('sample_data.csv')
print(data.head())--OUTPUT--id age income education
0 1 25 50000 4
1 2 30 65000 5
2 3 35 75000 6
3 4 40 85000 6
4 5 45 100000 7
The analysis begins by loading the dataset. The code uses the pandas library's pd.read_csv() function to import a CSV file and convert it into a DataFrame. Think of a DataFrame as a smart spreadsheet—a two-dimensional table that makes handling data in Python much easier.
Next, data.head() previews the first five rows. This is a crucial sanity check. It lets you quickly verify that your data loaded correctly and see what your columns—like age and income—look like before you dive deeper into the analysis.
Exploratory data analysis techniques
With your data loaded, you can move beyond a simple preview to summarize its key characteristics and visualize the patterns hidden beneath the surface.
Descriptive statistics with describe()
import pandas as pd
data = pd.read_csv('sample_data.csv')
summary = data.describe()
print(summary['age'])--OUTPUT--count 5.000000
mean 35.000000
std 7.905694
min 25.000000
25% 30.000000
50% 35.000000
75% 40.000000
max 45.000000
Name: age, dtype: float64
The describe() method is a powerful shortcut for generating descriptive statistics. It automatically computes key metrics for all numerical columns in your DataFrame, giving you a high-level overview in one go. The code then isolates the summary specifically for the age column.
- The
meanshows the average age is 35. - The
std(standard deviation) indicates how spread out the ages are from the average. - Quartiles—
25%,50%(the median), and75%—reveal the distribution of the data.
Data visualization with matplotlib
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('sample_data.csv')
plt.figure(figsize=(8, 5))
plt.bar(data['age'], data['income'])
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()--OUTPUT--[Bar chart showing income increasing with age]
Numbers don't always tell the full story. Visualizing data with a library like matplotlib helps you spot trends and relationships instantly. This code creates a bar chart to see how income changes with age, turning raw data into an intuitive visual.
- The
plt.bar()function is the core of the operation, mapping theageandincomecolumns to the chart’s axes. - Functions like
plt.xlabel()andplt.ylabel()add essential labels, making your plot readable. - Finally,
plt.show()renders the complete visual.
Advanced visualization with seaborn
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv('sample_data.csv')
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()--OUTPUT--[Heatmap showing correlations between all numeric variables]
While matplotlib is great for basic plots, seaborn helps you create more sophisticated visualizations with less code. This example generates a heatmap to quickly identify relationships between all numerical variables in your dataset.
- First,
data.corr()computes a correlation matrix, which measures how strongly variables are related to one another. - The
sns.heatmap()function then visualizes this matrix as a color-coded grid. - Setting
annot=Trueis a key step—it displays the actual correlation values on the plot, making your analysis more precise.
Advanced data analysis methods
Moving beyond exploration, you can now perform rigorous statistical tests with scipy, analyze time-based patterns with pandas, and build predictive models with scikit-learn.
Statistical analysis with scipy
import pandas as pd
from scipy import stats
data = pd.read_csv('sample_data.csv')
correlation, p_value = stats.pearsonr(data['age'], data['income'])
print(f"Correlation: {correlation:.3f}, P-value: {p_value:.4f}")--OUTPUT--Correlation: 0.996, P-value: 0.0003
While visualizations hint at relationships, statistical tests provide concrete proof. The code uses the scipy library's stats.pearsonr() function to quantify the linear relationship between the age and income columns. This function returns two crucial values that tell you if the connection is meaningful.
- The
correlationcoefficient (0.996) measures the strength and direction of the relationship. A value this close to 1 indicates a very strong positive correlation. - The
p_value(0.0003) tells you the statistical significance. A low p-value suggests the observed correlation isn't just a random fluke.
Time series analysis with pandas
import pandas as pd
import numpy as np
dates = pd.date_range('20230101', periods=5)
ts_data = pd.Series(np.random.randn(5).cumsum(), index=dates)
print(ts_data)
print("\nMoving average:")
print(ts_data.rolling(window=3).mean())--OUTPUT--2023-01-01 0.496714
2023-01-02 1.003825
2023-01-03 0.306560
2023-01-04 0.386352
2023-01-05 1.015908
dtype: float64
Moving average:
2023-01-01 NaN
2023-01-02 NaN
2023-01-03 0.602366
2023-01-04 0.565579
2023-01-05 0.569607
dtype: float64
Pandas is also excellent for time series analysis, which involves data points indexed in time order. The code first creates a sample dataset by pairing dates from pd.date_range() with random data. This sets up a basic time series, which is just a Series with a date-based index.
- The key technique shown is a moving average, calculated with
ts_data.rolling(window=3).mean(). - This function smooths out short-term fluctuations by averaging data points over a specified window—in this case, three periods.
- The initial
NaNvalues appear because there isn't enough preceding data to calculate the average for the first two points.
Machine learning with scikit-learn
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
data = pd.read_csv('sample_data.csv')
X = data[['age']]
y = data['income']
model = LinearRegression().fit(X, y)
print(f"Coefficient: {model.coef_[0]:.2f}, R²: {model.score(X, y):.3f}")--OUTPUT--Coefficient: 2500.00, R²: 0.991
With scikit-learn, you can move from analysis to prediction. This code builds a LinearRegression model to forecast income based on age. The fit() method is where the magic happens—it trains the model to understand the relationship between your feature (X) and target (y) variables.
- The
Coefficientof 2500.00 means that for every one-year increase in age, income is predicted to rise by $2500. - The
R²score of 0.991 shows that the model explains 99.1% of the income data's variance, indicating a very strong fit.
Move faster with Replit
Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.
For the data analysis techniques we've explored, Replit Agent can turn them into production tools:
- An interactive dashboard that uses
matplotlibandseabornto visualize sales data and reveal correlations between marketing spend and revenue. - A forecasting tool that applies
scikit-learn'sLinearRegressionmodel to predict future housing prices based on historical trends. - A real-time analytics utility that calculates and displays moving averages for website traffic, leveraging
pandastime series functions.
Describe your app idea, and Replit Agent writes the code, tests it, and fixes issues automatically, all in your browser.
Common errors and challenges
Even with powerful tools, you'll encounter common roadblocks; here's how to navigate missing values, indexing warnings, and data type errors.
Handling missing values with fillna() vs. dropna()
Real-world datasets are often messy and incomplete. When you encounter missing values (represented as NaN), you have two primary options for cleaning your DataFrame.
- The
dropna()method is the most direct approach—it removes any rows or columns containing missing data. While simple, this can lead to significant data loss if missing values are widespread. - A more nuanced alternative is
fillna(), which replaces missing values with a substitute. You can fill with a static number like zero, or use a calculated value like the column's mean or median to preserve the overall distribution of your data.
Avoiding chained indexing with pandas DataFrame
You might see a SettingWithCopyWarning when you try to modify a DataFrame using two sets of brackets, like data['age'][0] = 26. This is called chained indexing, and pandas warns you because the operation can be ambiguous. It's unclear whether you're modifying the original DataFrame or a temporary copy, which means your change might not stick.
The correct way to select and modify data is with the .loc[] or .iloc[] accessors. Using data.loc[0, 'age'] = 26 performs the selection and assignment in a single, guaranteed step, ensuring your update is applied directly to the original DataFrame without any warnings.
Debugging data type issues with astype()
A common headache is when a column of numbers is accidentally read as text, which prevents you from performing mathematical calculations. You can diagnose this by checking your column types with data.dtypes. If a numerical column shows up as object, you've found the problem.
The fix is straightforward with the astype() method. You can convert a column to the proper data type—like an integer or float—using a command such as data['income'] = data['income'].astype(int). This simple conversion unlocks the column for any numerical analysis you need to perform.
Handling missing values with fillna() vs. dropna()
Missing values, or NaNs, are common in real-world data. While it's tempting to just remove them with dropna(), this approach can accidentally delete important information from your dataset. The following code demonstrates this potential pitfall in action.
import pandas as pd
import numpy as np
data = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})
# Bug: Using dropna might remove too many rows
cleaned_data = data.dropna()
print(f"Original shape: {data.shape}")
print(f"Cleaned shape: {cleaned_data.shape}")
The dropna() function removes any row containing a missing value, which can discard too much data. The DataFrame shrinks from four rows to two, potentially skewing your analysis. The following code demonstrates a more targeted approach.
import pandas as pd
import numpy as np
data = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})
# Better approach: Fill missing values appropriately
cleaned_data = data.fillna(data.mean())
print(f"Original shape: {data.shape}")
print(f"Cleaned shape: {cleaned_data.shape}")
Instead of deleting rows, the fillna() method offers a smarter fix. The code replaces missing values with the mean of their respective columns, calculated with data.mean(). This preserves your dataset's original shape, so you don't lose valuable information.
This strategy is great when you want to maintain the statistical properties of your data without sacrificing rows. It's especially useful when your dataset has many scattered missing values, where dropna() would be too aggressive.
Avoiding chained indexing with pandas DataFrame
Chained indexing occurs when you use multiple brackets to select and modify data, but it can behave unpredictably. Pandas may issue a SettingWithCopyWarning because the operation might be acting on a temporary copy, not the original DataFrame.
The following code demonstrates how this can cause an update to fail silently, leaving your data unchanged.
import pandas as pd
df = pd.DataFrame({
'category': ['A', 'B', 'A', 'C', 'B'],
'value': [10, 20, 30, 40, 50]
})
# Bug: Chained indexing may not work as expected
df[df['category'] == 'A']['value'] = 0
print(df)
The first selection df[df['category'] == 'A'] returns a temporary copy. Your assignment then updates this copy—not the original df—which is why the change doesn't stick. The next example shows the correct way to do this.
import pandas as pd
df = pd.DataFrame({
'category': ['A', 'B', 'A', 'C', 'B'],
'value': [10, 20, 30, 40, 50]
})
# Fixed: Use loc for setting values
df.loc[df['category'] == 'A', 'value'] = 0
print(df)
The fix is to use the .loc[] accessor, which combines selection and assignment into a single, guaranteed operation. This method directly modifies the original DataFrame, preventing the SettingWithCopyWarning and ensuring your changes stick.
The expression df.loc[df['category'] == 'A', 'value'] = 0 clearly tells pandas to find all rows where the category is 'A' and set their 'value' to 0. This approach is reliable and should be your go-to for conditional updates.
Debugging data type issues with astype()
It's a classic data analysis snag: you try to run a calculation, but it fails because your numbers are stored as text. This prevents any math, like summing a column. The code below shows what happens when you try adding values that are actually strings.
import pandas as pd
df = pd.DataFrame({
'id': ['1', '2', '3', '4', '5'],
'amount': ['100', '200', '300', '400', '500']
})
# Bug: Trying to perform numeric operations on string data
total = df['amount'].sum()
print(f"Total amount: {total}")
The sum() method doesn't treat the values as numbers; it concatenates them as strings. This results in a long piece of text instead of a mathematical total. The following code shows how to prepare the data correctly.
import pandas as pd
df = pd.DataFrame({
'id': ['1', '2', '3', '4', '5'],
'amount': ['100', '200', '300', '400', '500']
})
# Fixed: Convert to appropriate data type first
df['amount'] = df['amount'].astype(int)
total = df['amount'].sum()
print(f"Total amount: {total}")
The fix is to explicitly convert the column to a numeric type before calculations. By using df['amount'].astype(int), you tell pandas to treat each value as an integer. Now, when you call sum(), it performs a mathematical addition instead of string concatenation, giving you the correct total. It's a common issue when importing data, so check your column types with .dtypes if calculations behave unexpectedly.
Real-world applications
With these techniques and debugging skills, you can use pandas to tackle challenges like cleaning customer data and analyzing regional sales.
Customer data cleaning and scoring with pandas
This example puts theory into practice, using fillna() to tidy up an incomplete customer list and then calculating a value_score to identify the most important accounts.
import pandas as pd
import numpy as np
# Sample customer data with missing values
customers = pd.DataFrame({
'customer_id': range(1, 6),
'age': [25, 38, np.nan, 45, 32],
'purchase_amount': [120, 350, 200, np.nan, 410],
'loyalty_years': [1.2, 5.5, 0.5, 8.2, 3.1]
})
# Clean missing values and create customer segments
customers_clean = customers.fillna(customers.mean())
customers_clean['value_score'] = (customers_clean['purchase_amount'] * 0.5 +
customers_clean['loyalty_years'] * 100)
print(customers_clean.sort_values('value_score', ascending=False))
This code shows a common workflow for preparing data and creating new insights. It begins with a DataFrame containing missing values, marked as np.nan. The fillna(customers.mean()) method is used to replace these gaps with the average value of their respective columns, which is a great way to keep your data intact.
- It then engineers a new feature,
value_score, by applying a weighted formula to thepurchase_amountandloyalty_yearscolumns. - Finally,
sort_values()organizes the DataFrame to rank customers based on this new score.
Regional sales analysis and reporting with pandas
This example uses pandas to transform raw sales figures into a concise performance report, calculating key metrics and using idxmax() to automatically identify the top-performing region.
import pandas as pd
# Sample sales data by region
sales = pd.DataFrame({
'region': ['North', 'South', 'East', 'West', 'Central'],
'units_sold': [125, 310, 205, 98, 175],
'revenue': [125000, 155000, 92250, 49000, 26250]
})
# Calculate KPIs and export to different formats
sales['avg_price'] = sales['revenue'] / sales['units_sold']
sales['contribution_pct'] = sales['revenue'] / sales['revenue'].sum() * 100
# Create performance summary
top_region = sales.loc[sales['revenue'].idxmax(), 'region']
summary = f"Top performing region: {top_region}"
print(sales.round(2))
print(f"\n{summary}")
This code shows how you can create new metrics from existing data. It calculates the avg_price and contribution_pct by applying simple arithmetic operations directly to entire columns. This vectorized approach is efficient and a core strength of pandas.
- The code finds the top-performing region by chaining two methods:
idxmax()identifies the index of the maximum revenue. - Then,
.loc[]uses that index to look up the corresponding region name, which is a clean way to extract specific information.
Get started with Replit
Turn these techniques into a real tool. Describe what you want, like “a dashboard that visualizes sales data with matplotlib” or “a simple calculator that predicts income based on age using linear regression.”
Give your idea to Replit Agent, which writes the code, tests for errors, and deploys the app for you. Start building with Replit.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.


.png)
.png)