How to do data analysis in Python

Your guide to data analysis in Python. Learn different methods, tips, real-world applications, and how to debug common errors.

Published on:

Tue

Mar 17, 2026

Updated on:

Wed

Mar 18, 2026

The Replit Team

ON THIS PAGE

Example H2

Python is a top choice for data analysis. It offers powerful libraries to manipulate, process, and visualize data. Its clear syntax makes complex data tasks more manageable for everyone.

In this article, you'll explore key techniques and practical tips for your projects. You will also see real-world applications and get advice to debug common issues, so you can analyze data with confidence.

Basic data analysis with pandas

import pandas as pd data = pd.read_csv('sample_data.csv') print(data.head())--OUTPUT--id age income education 0 1 25 50000 4 1 2 30 65000 5 2 3 35 75000 6 3 4 40 85000 6 4 5 45 100000 7

The analysis begins by loading the dataset. The code uses the pandas library's pd.read_csv() function to import a CSV file and convert it into a DataFrame. Think of a DataFrame as a smart spreadsheet—a two-dimensional table that makes handling data in Python much easier.

Next, data.head() previews the first five rows. This is a crucial sanity check. It lets you quickly verify that your data loaded correctly and see what your columns—like age and income—look like before you dive deeper into the analysis.

Exploratory data analysis techniques

With your data loaded, you can move beyond a simple preview to summarize its key characteristics and visualize the patterns hidden beneath the surface.

Descriptive statistics with `describe()`

import pandas as pd data = pd.read_csv('sample_data.csv') summary = data.describe() print(summary['age'])--OUTPUT--count 5.000000 mean 35.000000 std 7.905694 min 25.000000 25% 30.000000 50% 35.000000 75% 40.000000 max 45.000000 Name: age, dtype: float64

The describe() method is a powerful shortcut for generating descriptive statistics. It automatically computes key metrics for all numerical columns in your DataFrame, giving you a high-level overview in one go. The code then isolates the summary specifically for the age column.

The mean shows the average age is 35.
The std (standard deviation) indicates how spread out the ages are from the average.
Quartiles—25%, 50% (the median), and 75%—reveal the distribution of the data.

Data visualization with `matplotlib`

import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('sample_data.csv') plt.figure(figsize=(8, 5)) plt.bar(data['age'], data['income']) plt.xlabel('Age') plt.ylabel('Income') plt.show()--OUTPUT--[Bar chart showing income increasing with age]

Numbers don't always tell the full story. Visualizing data with a library like matplotlib helps you spot trends and relationships instantly. This code creates a bar chart to see how income changes with age, turning raw data into an intuitive visual.

The plt.bar() function is the core of the operation, mapping the age and income columns to the chart’s axes.
Functions like plt.xlabel() and plt.ylabel() add essential labels, making your plot readable.
Finally, plt.show() renders the complete visual.

Advanced visualization with `seaborn`

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt data = pd.read_csv('sample_data.csv') sns.heatmap(data.corr(), annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show()--OUTPUT--[Heatmap showing correlations between all numeric variables]

While matplotlib is great for basic plots, seaborn helps you create more sophisticated visualizations with less code. This example generates a heatmap to quickly identify relationships between all numerical variables in your dataset.

First, data.corr() computes a correlation matrix, which measures how strongly variables are related to one another.
The sns.heatmap() function then visualizes this matrix as a color-coded grid.
Setting annot=True is a key step—it displays the actual correlation values on the plot, making your analysis more precise.

Advanced data analysis methods

Moving beyond exploration, you can now perform rigorous statistical tests with scipy, analyze time-based patterns with pandas, and build predictive models with scikit-learn.

Statistical analysis with `scipy`

import pandas as pd from scipy import stats data = pd.read_csv('sample_data.csv') correlation, p_value = stats.pearsonr(data['age'], data['income']) print(f"Correlation: {correlation:.3f}, P-value: {p_value:.4f}")--OUTPUT--Correlation: 0.996, P-value: 0.0003

While visualizations hint at relationships, statistical tests provide concrete proof. The code uses the scipy library's stats.pearsonr() function to quantify the linear relationship between the age and income columns. This function returns two crucial values that tell you if the connection is meaningful.

The correlation coefficient (0.996) measures the strength and direction of the relationship. A value this close to 1 indicates a very strong positive correlation.
The p_value (0.0003) tells you the statistical significance. A low p-value suggests the observed correlation isn't just a random fluke.

Time series analysis with `pandas`

import pandas as pd import numpy as np dates = pd.date_range('20230101', periods=5) ts_data = pd.Series(np.random.randn(5).cumsum(), index=dates) print(ts_data) print("\nMoving average:") print(ts_data.rolling(window=3).mean())--OUTPUT--2023-01-01 0.496714 2023-01-02 1.003825 2023-01-03 0.306560 2023-01-04 0.386352 2023-01-05 1.015908 dtype: float64 Moving average: 2023-01-01 NaN 2023-01-02 NaN 2023-01-03 0.602366 2023-01-04 0.565579 2023-01-05 0.569607 dtype: float64

Pandas is also excellent for time series analysis, which involves data points indexed in time order. The code first creates a sample dataset by pairing dates from pd.date_range() with random data. This sets up a basic time series, which is just a Series with a date-based index.

The key technique shown is a moving average, calculated with ts_data.rolling(window=3).mean().
This function smooths out short-term fluctuations by averaging data points over a specified window—in this case, three periods.
The initial NaN values appear because there isn't enough preceding data to calculate the average for the first two points.

Machine learning with `scikit-learn`

import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split data = pd.read_csv('sample_data.csv') X = data[['age']] y = data['income'] model = LinearRegression().fit(X, y) print(f"Coefficient: {model.coef_[0]:.2f}, R²: {model.score(X, y):.3f}")--OUTPUT--Coefficient: 2500.00, R²: 0.991

With scikit-learn, you can move from analysis to prediction. This code builds a LinearRegression model to forecast income based on age. The fit() method is where the magic happens—it trains the model to understand the relationship between your feature (X) and target (y) variables.

The Coefficient of 2500.00 means that for every one-year increase in age, income is predicted to rise by $2500.
The R² score of 0.991 shows that the model explains 99.1% of the income data's variance, indicating a very strong fit.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

For the data analysis techniques we've explored, Replit Agent can turn them into production tools:

An interactive dashboard that uses matplotlib and seaborn to visualize sales data and reveal correlations between marketing spend and revenue.
A forecasting tool that applies scikit-learn's LinearRegression model to predict future housing prices based on historical trends.
A real-time analytics utility that calculates and displays moving averages for website traffic, leveraging pandas time series functions.

Describe your app idea, and Replit Agent writes the code, tests it, and fixes issues automatically, all in your browser.

Common errors and challenges

Even with powerful tools, you'll encounter common roadblocks; here's how to navigate missing values, indexing warnings, and data type errors.

Handling missing values with `fillna()` vs. `dropna()`

Real-world datasets are often messy and incomplete. When you encounter missing values (represented as NaN), you have two primary options for cleaning your DataFrame.

The dropna() method is the most direct approach—it removes any rows or columns containing missing data. While simple, this can lead to significant data loss if missing values are widespread.
A more nuanced alternative is fillna(), which replaces missing values with a substitute. You can fill with a static number like zero, or use a calculated value like the column's mean or median to preserve the overall distribution of your data.

Avoiding chained indexing with pandas DataFrame

You might see a SettingWithCopyWarning when you try to modify a DataFrame using two sets of brackets, like data['age'][0] = 26. This is called chained indexing, and pandas warns you because the operation can be ambiguous. It's unclear whether you're modifying the original DataFrame or a temporary copy, which means your change might not stick.

The correct way to select and modify data is with the .loc[] or .iloc[] accessors. Using data.loc[0, 'age'] = 26 performs the selection and assignment in a single, guaranteed step, ensuring your update is applied directly to the original DataFrame without any warnings.

Debugging data type issues with `astype()`

A common headache is when a column of numbers is accidentally read as text, which prevents you from performing mathematical calculations. You can diagnose this by checking your column types with data.dtypes. If a numerical column shows up as object, you've found the problem.

The fix is straightforward with the astype() method. You can convert a column to the proper data type—like an integer or float—using a command such as data['income'] = data['income'].astype(int). This simple conversion unlocks the column for any numerical analysis you need to perform.

Handling missing values with `fillna()` vs. `dropna()`

Missing values, or NaNs, are common in real-world data. While it's tempting to just remove them with dropna(), this approach can accidentally delete important information from your dataset. The following code demonstrates this potential pitfall in action.

import pandas as pd import numpy as np data = pd.DataFrame({ 'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8] }) # Bug: Using dropna might remove too many rows cleaned_data = data.dropna() print(f"Original shape: {data.shape}") print(f"Cleaned shape: {cleaned_data.shape}")

The dropna() function removes any row containing a missing value, which can discard too much data. The DataFrame shrinks from four rows to two, potentially skewing your analysis. The following code demonstrates a more targeted approach.

import pandas as pd import numpy as np data = pd.DataFrame({ 'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8] }) # Better approach: Fill missing values appropriately cleaned_data = data.fillna(data.mean()) print(f"Original shape: {data.shape}") print(f"Cleaned shape: {cleaned_data.shape}")

Instead of deleting rows, the fillna() method offers a smarter fix. The code replaces missing values with the mean of their respective columns, calculated with data.mean(). This preserves your dataset's original shape, so you don't lose valuable information.

This strategy is great when you want to maintain the statistical properties of your data without sacrificing rows. It's especially useful when your dataset has many scattered missing values, where dropna() would be too aggressive.

Avoiding chained indexing with pandas DataFrame

Chained indexing occurs when you use multiple brackets to select and modify data, but it can behave unpredictably. Pandas may issue a SettingWithCopyWarning because the operation might be acting on a temporary copy, not the original DataFrame.

The following code demonstrates how this can cause an update to fail silently, leaving your data unchanged.

import pandas as pd df = pd.DataFrame({ 'category': ['A', 'B', 'A', 'C', 'B'], 'value': [10, 20, 30, 40, 50] }) # Bug: Chained indexing may not work as expected df[df['category'] == 'A']['value'] = 0 print(df)

The first selection df[df['category'] == 'A'] returns a temporary copy. Your assignment then updates this copy—not the original df—which is why the change doesn't stick. The next example shows the correct way to do this.

import pandas as pd df = pd.DataFrame({ 'category': ['A', 'B', 'A', 'C', 'B'], 'value': [10, 20, 30, 40, 50] }) # Fixed: Use loc for setting values df.loc[df['category'] == 'A', 'value'] = 0 print(df)

The fix is to use the .loc[] accessor, which combines selection and assignment into a single, guaranteed operation. This method directly modifies the original DataFrame, preventing the SettingWithCopyWarning and ensuring your changes stick.

The expression df.loc[df['category'] == 'A', 'value'] = 0 clearly tells pandas to find all rows where the category is 'A' and set their 'value' to 0. This approach is reliable and should be your go-to for conditional updates.

Debugging data type issues with `astype()`

It's a classic data analysis snag: you try to run a calculation, but it fails because your numbers are stored as text. This prevents any math, like summing a column. The code below shows what happens when you try adding values that are actually strings.

import pandas as pd df = pd.DataFrame({ 'id': ['1', '2', '3', '4', '5'], 'amount': ['100', '200', '300', '400', '500'] }) # Bug: Trying to perform numeric operations on string data total = df['amount'].sum() print(f"Total amount: {total}")

The sum() method doesn't treat the values as numbers; it concatenates them as strings. This results in a long piece of text instead of a mathematical total. The following code shows how to prepare the data correctly.

import pandas as pd df = pd.DataFrame({ 'id': ['1', '2', '3', '4', '5'], 'amount': ['100', '200', '300', '400', '500'] }) # Fixed: Convert to appropriate data type first df['amount'] = df['amount'].astype(int) total = df['amount'].sum() print(f"Total amount: {total}")

The fix is to explicitly convert the column to a numeric type before calculations. By using df['amount'].astype(int), you tell pandas to treat each value as an integer. Now, when you call sum(), it performs a mathematical addition instead of string concatenation, giving you the correct total. It's a common issue when importing data, so check your column types with .dtypes if calculations behave unexpectedly.

Real-world applications

With these techniques and debugging skills, you can use pandas to tackle challenges like cleaning customer data and analyzing regional sales.

Customer data cleaning and scoring with `pandas`

This example puts theory into practice, using fillna() to tidy up an incomplete customer list and then calculating a value_score to identify the most important accounts.

import pandas as pd import numpy as np # Sample customer data with missing values customers = pd.DataFrame({ 'customer_id': range(1, 6), 'age': [25, 38, np.nan, 45, 32], 'purchase_amount': [120, 350, 200, np.nan, 410], 'loyalty_years': [1.2, 5.5, 0.5, 8.2, 3.1] }) # Clean missing values and create customer segments customers_clean = customers.fillna(customers.mean()) customers_clean['value_score'] = (customers_clean['purchase_amount'] * 0.5 + customers_clean['loyalty_years'] * 100) print(customers_clean.sort_values('value_score', ascending=False))

This code shows a common workflow for preparing data and creating new insights. It begins with a DataFrame containing missing values, marked as np.nan. The fillna(customers.mean()) method is used to replace these gaps with the average value of their respective columns, which is a great way to keep your data intact.

It then engineers a new feature, value_score, by applying a weighted formula to the purchase_amount and loyalty_years columns.
Finally, sort_values() organizes the DataFrame to rank customers based on this new score.

Regional sales analysis and reporting with `pandas`

This example uses pandas to transform raw sales figures into a concise performance report, calculating key metrics and using idxmax() to automatically identify the top-performing region.

import pandas as pd # Sample sales data by region sales = pd.DataFrame({ 'region': ['North', 'South', 'East', 'West', 'Central'], 'units_sold': [125, 310, 205, 98, 175], 'revenue': [125000, 155000, 92250, 49000, 26250] }) # Calculate KPIs and export to different formats sales['avg_price'] = sales['revenue'] / sales['units_sold'] sales['contribution_pct'] = sales['revenue'] / sales['revenue'].sum() * 100 # Create performance summary top_region = sales.loc[sales['revenue'].idxmax(), 'region'] summary = f"Top performing region: {top_region}" print(sales.round(2)) print(f"\n{summary}")

This code shows how you can create new metrics from existing data. It calculates the avg_price and contribution_pct by applying simple arithmetic operations directly to entire columns. This vectorized approach is efficient and a core strength of pandas.

The code finds the top-performing region by chaining two methods: idxmax() identifies the index of the maximum revenue.
Then, .loc[] uses that index to look up the corresponding region name, which is a clean way to extract specific information.

Get started with Replit

Turn these techniques into a real tool. Describe what you want, like “a dashboard that visualizes sales data with matplotlib” or “a simple calculator that predicts income based on age using linear regression.”

Give your idea to Replit Agent, which writes the code, tests for errors, and deploys the app for you. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free