How to create synthetic data in Python
Learn how to create synthetic data in Python. Explore different methods, tips, real-world applications, and how to debug common errors.
.png)
Synthetic data is a powerful tool for machine learning and software testing. Python provides robust libraries that let you generate realistic, artificial datasets when real-world information is not an option.
In this article, you’ll explore key techniques and practical tips for data generation. You'll also find real-world applications and debugging advice to help you refine your approach for any project.
Creating simple random data with NumPy
import numpy as np
# Generate 5 random integers between 0 and 100
random_integers = np.random.randint(0, 100, size=5)
print(random_integers)--OUTPUT--[42 71 13 97 56]
NumPy is your go-to for numerical tasks in Python, and it’s perfect for generating simple random data. The function np.random.randint() creates an array of integers within a range you define. It’s a straightforward way to get a baseline dataset for testing.
Here, it generates five random integers from 0 up to, but not including, 100. The size parameter lets you easily control how much data you create. The resulting NumPy array is a fundamental building block for many machine learning and testing scenarios.
Basic synthetic data generation techniques
While random integers are a good start, you can create more realistic datasets by modeling distributions, relationships, and time-dependent patterns.
Using normal distribution for realistic values
import numpy as np
# Generate 1000 samples from a normal distribution
# with mean=70 and standard deviation=5
heights = np.random.normal(70, 5, 1000)
print(f"Mean: {heights.mean():.2f}, Std: {heights.std():.2f}")
print(f"First 5 heights: {heights[:5]}")--OUTPUT--Mean: 70.11, Std: 4.95
First 5 heights: [69.17832308 77.11650218 70.79055741 68.43590158 71.66359423]
Real-world data often follows a normal distribution, where most values cluster around an average. NumPy's np.random.normal() function is perfect for simulating this. It creates more realistic datasets for things like human heights or test scores.
- The mean is set to
70, representing the central point of the data. - A standard deviation of
5determines how spread out the values are. - The code generates
1000individual data points.
Creating correlated features
import numpy as np
# Generate two correlated variables
n = 1000
x = np.random.normal(0, 1, n)
# y has a 0.8 correlation with x plus some noise
y = 0.8 * x + 0.2 * np.random.normal(0, 1, n)
correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation between x and y: {correlation:.4f}")--OUTPUT--Correlation between x and y: 0.8021
In real-world data, features are often related. This code creates two correlated variables, where the value of y depends on x. The formula y = 0.8 * x + ... establishes a strong positive relationship, while the second term adds a bit of random noise to keep it from being a perfect one-to-one match.
- This technique mimics natural relationships, like height and weight, where one influences the other.
- Finally,
np.corrcoef()calculates the correlation, confirming how closely the two variables move together.
Generating synthetic time series data
import numpy as np
import pandas as pd
# Create a time series with trend and seasonality
dates = pd.date_range('2023-01-01', periods=100, freq='D')
trend = np.linspace(0, 5, 100)
seasonality = 2 * np.sin(np.linspace(0, 12 * np.pi, 100))
noise = np.random.normal(0, 0.5, 100)
ts_data = trend + seasonality + noise
time_series = pd.Series(ts_data, index=dates)
print(time_series.head())--OUTPUT--2023-01-01 0.068244
2023-01-02 0.288208
2023-01-03 0.857800
2023-01-04 1.402486
2023-01-05 1.722618
Freq: D, dtype: float64
This code generates time series data by combining three key elements—a common way to simulate data like stock prices or daily sales figures. The final output is a Pandas Series, which pairs each data point with a specific date from a date_range.
- Trend: The
np.linspace()function creates a steady, linear progression over time. - Seasonality: A sine wave from
np.sin()adds a repeating, cyclical pattern. - Noise: Random values from
np.random.normal()make the data less predictable and more realistic.
Advanced synthetic data methods
While NumPy provides the building blocks, you can create more sophisticated datasets using scikit-learn’s generators, pandas DataFrames, and other specialized libraries.
Using scikit-learn's built-in dataset generators
from sklearn.datasets import make_classification
# Generate a synthetic classification dataset
X, y = make_classification(
n_samples=1000, n_features=4, n_informative=2,
n_redundant=0, random_state=42
)
print(f"Data shape: {X.shape}, Target shape: {y.shape}")
print(f"First sample: {X[0]}, Class: {y[0]}")--OUTPUT--Data shape: (1000, 4), Target shape: (1000,)
First sample: [ 1.3218364 -0.57534663 0.61667161 -1.99496865], Class: 1
Scikit-learn’s make_classification function is a powerful shortcut for creating datasets to test classification models. It gives you fine-grained control over the data’s structure, letting you simulate realistic scenarios where some features are more useful than others.
- The
n_informativeparameter is key—it specifies how many features actually influence the classification outcome. n_samplesandn_featuresdefine the dataset's dimensions.- Using
random_stateensures you get the same "random" data every time, which is crucial for reproducible experiments.
Creating structured datasets with pandas
import pandas as pd
import numpy as np
# Generate a synthetic customer dataset
n = 5
df = pd.DataFrame({
'customer_id': range(1001, 1001 + n),
'age': np.random.randint(18, 80, n),
'income': np.random.normal(50000, 15000, n),
'is_active': np.random.choice([True, False], n, p=[0.8, 0.2])
})
print(df)--OUTPUT--customer_id age income is_active
0 1001 61 55127.8635 True
1 1002 32 63461.0136 True
2 1003 27 44667.7749 True
3 1004 32 50292.0200 False
4 1005 49 36598.9539 True
Pandas DataFrames are perfect for creating structured, table-like datasets that mix different data types. This code builds a customer table by assigning generated data to named columns, creating a realistic, multi-faceted dataset for testing.
- It combines familiar NumPy functions for columns like
ageandincome. - The
is_activecolumn usesnp.random.choicewith apargument to create a weighted distribution—in this case, making 80% of the customers active.
Using specialized synthetic data libraries
# Using SDV (Synthetic Data Vault) for tabular data
from sdv.tabular import GaussianCopula
# Create and fit model on sample data
model = GaussianCopula()
data = pd.DataFrame({
'age': [25, 32, 45, 63, 58],
'income': [50000, 75000, 63000, 82000, 45000],
'education': ['Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']
})
model.fit(data)
# Generate synthetic samples
synthetic_data = model.sample(3)
print(synthetic_data)--OUTPUT--age income education
0 51 64173.4732 Bachelor
1 38 72865.3462 Master
2 61 53296.8744 Bachelor
For complex datasets, specialized libraries like SDV (Synthetic Data Vault) are a game-changer. Instead of just generating random values, they learn the statistical patterns from a sample dataset. This allows you to create new, artificial data that preserves the original's structure and correlations, including relationships between mixed data types.
- The code first fits a
GaussianCopulamodel to a small sample DataFrame using themodel.fit()method. This step learns the relationships between columns likeage,income, andeducation. - Once trained, you can call
model.sample()to generate new, entirely synthetic rows that follow the learned patterns.
Move faster with Replit
Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.
The synthetic data techniques from this article can be turned into production tools. For example, Replit Agent can build:
- A sales forecasting dashboard that generates and visualizes time series data with customizable trends and seasonality.
- A user simulation tool that produces realistic customer datasets using weighted probabilities from
np.random.choice. - A machine learning testbed that generates classification datasets with
make_classificationto benchmark model performance.
Start building your own tools by describing them to Replit Agent. It writes the code, tests it, and fixes issues automatically, all in your browser.
Common errors and challenges
Generating synthetic data can be tricky, but most errors are simple to fix once you know what to look for.
- Fixing the seed for reproducible results: When you need your "random" data to be the same every time, you'll want to set a seed with
np.random.seed(). Without it, your results will change on each execution, making it impossible to reproduce bugs or compare model performance accurately. Using the same seed number guarantees you'll get the same sequence of random values. - Avoiding incorrect parameter order: A frequent slip-up with
np.random.normal()is mixing up the mean (loc) and standard deviation (scale). Accidentally swapping them generates data with completely different statistical properties. To prevent this, use named arguments likenp.random.normal(loc=70, scale=5)to make your code clearer. - Handling shape issues: Shape mismatches are a common headache when using
np.random.randint(). If you want an array but get a single integer, you likely forgot thesizeparameter. Always double-check that the output array's shape matches what the rest of your code expects to avoid hard-to-trace errors.
Fixing the seed for reproducible results with np.random.seed()
For reliable testing and debugging, your "random" data must be consistent across runs. Without a fixed starting point, or "seed," you can't reproduce results, making it hard to verify fixes. The code below shows what happens without setting a seed.
import numpy as np
# Generate random numbers that change each run
random_data = np.random.rand(3)
print("First run:", random_data)
random_data = np.random.rand(3)
print("Second run:", random_data)
Each call to np.random.rand(3) produces a new, unpredictable array because the generator's starting point isn't fixed. This makes consistent testing impossible. The following code shows how to get the same results every time.
import numpy as np
# Set seed for reproducibility
np.random.seed(42)
random_data = np.random.rand(3)
print("First run:", random_data)
np.random.seed(42)
random_data = np.random.rand(3)
print("Second run:", random_data)
By calling np.random.seed(42) before generating data, you fix the starting point for NumPy's random number generator. This guarantees you get the same sequence of "random" numbers every time you run the code. It's crucial for:
- Debugging code that relies on random inputs.
- Creating reproducible machine learning experiments.
- Ensuring others can replicate your results exactly.
The number itself doesn't matter—as long as you use the same one, your results will be consistent.
Avoiding incorrect parameter order in np.random.normal()
A common mistake with np.random.normal() is swapping the mean and standard deviation parameters. This simple error can silently corrupt your dataset, leading to skewed results. The code below shows what happens when the parameters are accidentally mixed up.
import numpy as np
# Incorrect parameter order (size, mean, std)
samples = np.random.normal(100, 70, 10)
print(f"Mean: {samples.mean():.2f}, Std: {samples.std():.2f}")
Since np.random.normal() reads arguments positionally, it sets the mean to 100 and the standard deviation to 70. This creates an extremely wide data distribution, which isn't the goal. Check the code below for a more robust approach.
import numpy as np
# Correct parameter order (mean, std, size)
samples = np.random.normal(70, 10, 100)
print(f"Mean: {samples.mean():.2f}, Std: {samples.std():.2f}")
By providing the arguments in the correct order—mean, standard deviation, and size—you get the intended result. The function np.random.normal(70, 10, 100) correctly generates 100 samples centered around a mean of 70 with a standard deviation of 10. Always double-check a function's documentation when you're unsure about positional arguments, as it's an easy mistake to make when generating statistical data. This ensures your dataset's properties are what you expect.
Handling shape issues with np.random.randint()
A frequent pitfall with np.random.randint() is a TypeError from incorrect syntax. This happens when you define a multi-dimensional array's shape using separate arguments instead of a single tuple for the size parameter. The code below demonstrates this common mistake.
import numpy as np
# Trying to create a 3x3 matrix but using wrong syntax
matrix = np.random.randint(0, 10, 3, 3)
print(matrix)
This code triggers a TypeError because np.random.randint() reads the extra 3 as an invalid data type, not a dimension. To create a matrix, you need to pass the shape differently. See the correct implementation below.
import numpy as np
# Correct way to specify shape for a 3x3 matrix
matrix = np.random.randint(0, 10, size=(3, 3))
print(matrix)
The correct approach is to pass the desired shape as a tuple to the size parameter. By using size=(3, 3), you're explicitly telling NumPy to create a 3x3 matrix. This avoids the TypeError that occurs when the function misinterprets the extra numbers as invalid arguments. It's a common mistake when you're building multi-dimensional arrays for tasks like image processing or setting up machine learning inputs.
Real-world applications
Beyond the code, these data generation methods solve tangible problems, from simulating financial markets to creating image datasets for machine learning.
Simulating financial market data with random walks
A random walk model is a powerful method for simulating the unpredictable movements of financial assets, treating each price change as a random step from the previous one.
import numpy as np
# Simulate stock price using random walk
initial_price = 100
days = 252 # Trading days in a year
daily_returns = np.random.normal(0.0005, 0.01, days)
price_series = initial_price * np.cumprod(1 + daily_returns)
print(f"Price journey: ${initial_price:.2f} → ${price_series[-1]:.2f}")
print(f"Range: ${price_series.min():.2f} - ${price_series.max():.2f}")
This code models a stock's price path over a year. It begins with an initial_price and then simulates daily price changes based on random fluctuations.
- The
np.random.normal()function generates an array of daily returns. These returns are centered around a slight positive average to simulate growth, with a standard deviation to represent market volatility. - Next,
np.cumprod()calculates the cumulative product of these returns. This step chains the daily changes together, building a new price path where each day's value depends on the last.
Creating synthetic image data for ML training
NumPy arrays can represent simple images, allowing you to generate entire datasets of pixel data for training and testing computer vision models.
import numpy as np
# Generate a dataset of noisy images for machine learning
n_samples = 5
n_features = 16 # 4x4 images
X = np.random.rand(n_samples, n_features) # Features (flattened images)
y = np.random.randint(0, 2, n_samples) # Binary labels
# Reshape first image to 2D for visualization
first_image = X[0].reshape(4, 4)
print(f"Dataset: {n_samples} images, each {int(np.sqrt(n_features))}x{int(np.sqrt(n_features))}")
print(f"First image (pixel values):\n{first_image.round(2)}")
print(f"Labels: {y}")
This code builds a basic dataset for a machine learning classification task. It generates two key components: a feature matrix X and a target vector y.
- The
Xmatrix contains five samples, each with 16 features representing a flattened 4x4 image. Thenp.random.rand()function fills it with random pixel values. - The
yvector holds a corresponding binary label (0 or 1) for each image, created withnp.random.randint().
Finally, the code reshapes the first sample into a 4x4 grid to help you visualize the image structure.
Get started with Replit
Turn these techniques into a real tool. Tell Replit Agent to “build a stock price simulator using a random walk model” or “create a dashboard that generates time series data with adjustable trends.”
It writes the code, tests for errors, and deploys your app automatically. Start building with Replit.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.


.png)
.png)