How to split data into training and testing sets in Python

Learn how to split data into training and testing sets in Python. Explore methods, tips, real-world applications, and common error fixes.

How to split data into training and testing sets in Python
Published on: 
Fri
Feb 20, 2026
Updated on: 
Mon
Apr 6, 2026
The Replit Team

To build a reliable machine learning model, you must split your data into training and testing sets. This step ensures you can accurately evaluate your model's performance on unseen data.

In this article, we'll cover several techniques to split your data using popular Python libraries. We'll also provide practical tips, explore real-world applications, and offer debugging advice for a smooth implementation.

Basic split using train_test_split

from sklearn.model_selection import train_test_split
import numpy as np

X = np.arange(100).reshape(50, 2) # Sample features
y = np.arange(50) # Sample labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}, Testing set shape: {X_test.shape}")--OUTPUT--Training set shape: (40, 2), Testing set shape: (10, 2)

The train_test_split function from scikit-learn is a straightforward way to partition your data. It shuffles the dataset randomly before splitting it, which helps prevent any bias from the original data order. Two key parameters control this process:

  • test_size=0.2: This argument specifies that 20% of your data will be set aside for the test set. The remaining 80% is used for training.
  • random_state=42: Setting this ensures your split is reproducible. Every time you run the code, you'll get the exact same training and testing sets, which is crucial for consistent model evaluation.

Core splitting techniques

While train_test_split is a versatile tool, certain situations demand more specialized techniques for manual control, balancing classes, or handling sequential data.

Manual splitting with numpy

import numpy as np

data = np.arange(100).reshape(50, 2)
labels = np.arange(50)

train_ratio = 0.8
train_size = int(len(data) * train_ratio)

train_data, test_data = data[:train_size], data[train_size:]
train_labels, test_labels = labels[:train_size], labels[train_size:]

print(f"Train size: {len(train_data)}, Test size: {len(test_data)}")--OUTPUT--Train size: 40, Test size: 10

Manually splitting with numpy gives you direct control over how your data is divided. You simply calculate a split index based on a train_ratio and use Python's slicing syntax—like data[:train_size] and data[train_size:]—to create the training and testing sets in a memory-efficient way.

  • A crucial point to remember is that this method doesn't shuffle the data. It performs a sequential split, which might be ideal for time-series data but could introduce bias if your dataset has an inherent order.

Stratified splitting for balanced classes

from sklearn.model_selection import train_test_split
import numpy as np

X = np.random.randn(100, 2)
y = np.array([0] * 80 + [1] * 20) # Imbalanced classes

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"Original class ratio: {np.bincount(y) / len(y)}")
print(f"Training class ratio: {np.bincount(y_train) / len(y_train)}")--OUTPUT--Original class ratio: [0.8 0.2]
Training class ratio: [0.8 0.2]

When your dataset has imbalanced classes—for example, 80% of one category and 20% of another—a standard random split can be misleading. You risk creating a test set that doesn't fairly represent the minority class, which skews your model's performance evaluation.

  • Stratified splitting solves this problem. By adding the parameter stratify=y, you tell the function to preserve the original percentage of each class in both the training and testing sets. This ensures your model is evaluated on a truly representative data sample.

Time-based splitting for sequential data

import pandas as pd
import numpy as np

dates = pd.date_range(start='2023-01-01', periods=100, freq='D')
data = np.random.randn(100, 2)
time_series = pd.DataFrame(data, index=dates, columns=['A', 'B'])

cutoff_date = '2023-03-15'
train_df = time_series.loc[:cutoff_date]
test_df = time_series.loc[cutoff_date:]

print(f"Training: {train_df.shape} rows, Testing: {test_df.shape} rows")--OUTPUT--Training: (74, 2) rows, Testing: (26, 2) rows

For sequential data, like stock prices or sensor readings, random shuffling isn't an option because it destroys the chronological order. You need to train your model on past data to predict future events. This approach mimics how you'd use the model in the real world.

  • The code defines a cutoff_date to split the dataset. All data before this date is used for training, while the data from that point onward is reserved for testing.
  • This is done using pandas' .loc accessor, which selects data by its index—in this case, the dates.

Advanced splitting methods

When a simple train-test split doesn't cut it, advanced methods provide more robust ways to validate your model and manage complex data dependencies. These techniques are perfect for vibe coding machine learning experiments.

Cross-validation with KFold

from sklearn.model_selection import KFold
import numpy as np

X = np.random.rand(100, 4)
y = np.random.randint(0, 2, 100)

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for fold, (train_idx, test_idx) in enumerate(kf.split(X)):
print(f"Fold {fold+1}: {len(train_idx)} train samples, {len(test_idx)} test samples")
if fold == 0: # Show only first fold details
break--OUTPUT--Fold 1: 80 train samples, 20 test samples

Cross-validation gives you a more reliable estimate of your model's performance than a single train and test split. The KFold function automates this by splitting the data into a specified number of parts, or "folds," then training and testing the model multiple times. This process ensures your results aren't just a fluke from one specific data partition, making it an essential technique for AI coding workflows.

  • The KFold object is set up with n_splits=5, creating five distinct folds from the data.
  • The loop iterates through each fold, using one for testing and the remaining four for training, giving you a more comprehensive performance metric.

Multi-stage splitting for train/validation/test

from sklearn.model_selection import train_test_split
import numpy as np

X = np.random.randn(1000, 5)
y = np.random.randint(0, 2, 1000)

# First split out test set
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Split remaining data into train and validation
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)

print(f"Train: {len(X_train)}, Validation: {len(X_val)}, Test: {len(X_test)}")--OUTPUT--Train: 600, Validation: 200, Test: 200

A simple train/test split isn't enough when you need to tune your model's hyperparameters. Using the test set for tuning can cause your model to indirectly learn from it, leading to an overly optimistic performance estimate. This is where a three-way split—train, validation, and test—comes in. The validation set is for tuning, while the test set is reserved for the final, unbiased evaluation.

  • This is achieved with two sequential calls to train_test_split. First, you split off a test set from the entire dataset.
  • Next, you split the remaining data again to create your training and validation sets. Using test_size=0.25 on the remaining 80% of data results in a final 60% train, 20% validation, and 20% test split.

Using GroupKFold for dependent samples

from sklearn.model_selection import GroupKFold
import numpy as np

X = np.random.rand(20, 2)
y = np.random.randint(0, 2, 20)
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7]

gkf = GroupKFold(n_splits=3)
for i, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups)):
print(f"Fold {i+1} - Train groups: {np.unique([groups[i] for i in train_idx])}")
print(f"Fold {i+1} - Test groups: {np.unique([groups[i] for i in test_idx])}")
if i == 0: # Show only first fold
break--OUTPUT--Fold 1 - Train groups: [1 2 4 6 7]
Fold 1 - Test groups: [3 5]

Sometimes your data isn't independent. You might have multiple data points from the same user or sensor, for example. GroupKFold is designed for these situations, preventing your model from being tested on data it has indirectly seen during training.

  • It works by ensuring all samples from a specific group—defined by the groups array—are kept together.
  • This means an entire group will land in either the training set or the test set, but never be split across both, giving you a more accurate performance evaluation.

Move faster with Replit

Learning individual techniques is one thing, but building a complete application is another. Replit is an AI-powered development platform designed to bridge that gap. It comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. With Agent 4, you can take an idea to a working product—it handles the code, databases, APIs, and deployment, all from a simple description.

Instead of piecing together techniques, you can build tools that apply them in a real-world context:

  • A performance dashboard that uses stratified splitting to fairly evaluate a fraud detection model on an imbalanced transaction dataset.
  • A backtesting tool for a stock trading algorithm that uses time-based splitting to train on historical price data and test on recent data.
  • A churn prediction model for a subscription service that uses GroupKFold to ensure all data from a single user stays in either the training or test set, preventing data leakage.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Even with the right tools, splitting data can lead to subtle errors that compromise your model's integrity and reproducibility.

Fixing data leakage when preprocessing time series data

Data leakage is a common pitfall, especially with time-series data. It happens when information from your test set inadvertently bleeds into your training set, giving your model an unrealistic performance boost. This often occurs during preprocessing steps like scaling or normalization if you apply them to the entire dataset before splitting.

  • The problem is that fitting a scaler on all your data allows it to learn statistical properties—like the mean and standard deviation—from the future data you've reserved for testing.
  • To fix this, you must always split your data first. Then, fit your preprocessor exclusively on the training data and use that same fitted preprocessor to transform both the training and test sets.

Troubleshooting incorrect usage of stratify with continuous variables

The stratify parameter is a powerful tool for handling imbalanced classes, but it's designed to work with categorical labels, not continuous data. If you try to pass a continuous variable to stratify, you'll get an error because the function can't maintain proportions for an infinite number of unique values.

  • The solution is to discretize the continuous variable first. You can convert the numerical data into a set of bins or categories, such as "low," "medium," and "high."
  • Once you have these new categorical labels, you can pass them to the stratify parameter to ensure your train and test sets have a similar distribution based on those bins.

Resolving random_state issues in reproducible splits

Reproducibility is crucial for reliable machine learning experiments. If you can't get the same result twice, you can't confidently measure the impact of your changes. The random_state parameter in functions like train_test_split is the key to achieving this consistency.

  • When you omit random_state, the function uses a different random seed for each run, resulting in a different data split every time. This makes it impossible to debug issues or fairly compare model performance.
  • Always set random_state to a fixed integer. This ensures that the shuffling and splitting process is identical every time you run the code, making your results stable and reproducible.

Fixing data leakage when preprocessing time series data

A frequent mistake with time-series data is applying preprocessing steps, like scaling, to the entire dataset before splitting. This contaminates your training data with information from the future, leading to an overly optimistic evaluation of your model's performance.

The following code demonstrates this error. Notice how the StandardScaler is fitted on the full dataset before it's divided into training and testing sets.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

dates = pd.date_range('2023-01-01', periods=100)
df = pd.DataFrame({'value': np.cumsum(np.random.randn(100))}, index=dates)

# Incorrect: scaling before splitting
scaler = StandardScaler()
df['scaled_value'] = scaler.fit_transform(df[['value']])
train_df, test_df = df.iloc[:80], df.iloc[80:]

By using scaler.fit_transform on the entire dataframe, the scaler learns from data that should be reserved for testing. This gives the model an unfair preview of the future. The corrected implementation is shown below.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

dates = pd.date_range('2023-01-01', periods=100)
df = pd.DataFrame({'value': np.cumsum(np.random.randn(100))}, index=dates)

# Correct: split first, then scale
train_df, test_df = df.iloc[:80], df.iloc[80:]
scaler = StandardScaler()
train_df['scaled_value'] = scaler.fit_transform(train_df[['value']])
test_df['scaled_value'] = scaler.transform(test_df[['value']])

The solution is to split your data *before* any preprocessing. This prevents the model from learning from future data it shouldn't see yet.

  • First, fit your scaler on the training data using fit_transform(). This teaches the scaler the statistical properties of only the training set.
  • Then, apply that same fitted scaler to transform the test data using just transform(). This correctly applies the learned scaling without leaking information.

Troubleshooting incorrect usage of stratify with continuous variables

The stratify parameter is designed for categorical labels, ensuring class proportions are maintained. It doesn't work with continuous data, like prices or measurements, because there are too many unique values to balance. Attempting this will trigger an error, as shown below.

from sklearn.model_selection import train_test_split
import numpy as np

X = np.random.randn(100, 3)
y = np.random.randn(100) # Continuous target

# Error: can't stratify with continuous values
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, test_size=0.3, random_state=42
)

The error happens because stratify=y is passed a continuous variable. The function can't create proportional splits from unique floating-point values and needs discrete classes instead. The following code demonstrates the correct implementation.

from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

X = np.random.randn(100, 3)
y = np.random.randn(100) # Continuous target

# Create categorical bins for stratification
y_binned = pd.qcut(y, q=5, labels=False)
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y_binned, test_size=0.3, random_state=42
)

The solution is to convert the continuous variable into discrete categories before splitting. This is useful when your target variable is skewed and you want to maintain its distribution in both the train and test sets.

  • The code uses pandas' pd.qcut function to group the continuous y values into five bins.
  • This creates a new categorical series, y_binned, which you can then pass to the stratify parameter to perform a balanced split.

Resolving random_state issues in reproducible splits

When the random_state parameter is omitted, functions like train_test_split produce a different result on every run. This inconsistency makes debugging and model comparison unreliable. The following code demonstrates this by running the same split twice and comparing the outputs.

from sklearn.model_selection import train_test_split
import numpy as np

X = np.random.randn(100, 4)
y = np.random.randint(0, 2, 100)

# Without fixed random_state, results differ each run
split1 = train_test_split(X, y, test_size=0.3)
split2 = train_test_split(X, y, test_size=0.3)
print(f"Same splits? {np.array_equal(split1[0], split2[0])}")

The np.array_equal check confirms the two splits are different. Since no random_state was set, the function produces a new, random partition on each run. The following code shows how to ensure consistency.

from sklearn.model_selection import train_test_split
import numpy as np

X = np.random.randn(100, 4)
y = np.random.randint(0, 2, 100)

# Fix random_state for reproducibility
split1 = train_test_split(X, y, test_size=0.3, random_state=42)
split2 = train_test_split(X, y, test_size=0.3, random_state=42)
print(f"Same splits? {np.array_equal(split1[0], split2[0])}")

The `np.array_equal` check now confirms the splits are identical. By setting `random_state=42` in both calls to `train_test_split`, you're seeding the random number generator with a fixed value. This makes the shuffling and splitting process deterministic.

  • Always use a fixed `random_state` during development and experimentation. It ensures your results are reproducible, which is essential for debugging and making fair comparisons between different model versions or hyperparameters.

Real-world applications

With the common pitfalls addressed, these splitting techniques can be applied to solve critical problems in medicine and finance.

Handling imbalanced data in medical diagnosis with train_test_split

In medical diagnostics, where positive cases can be rare, using train_test_split with the stratify parameter is crucial to ensure your test set accurately represents the real-world class imbalance.

from sklearn.model_selection import train_test_split
import numpy as np

# Simulate imbalanced medical dataset (5% positive cases)
np.random.seed(42)
X = np.random.randn(1000, 5) # 5 features
y = np.zeros(1000, dtype=int)
y[:50] = 1 # Only 5% positive cases

# Regular split vs stratified split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.3)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.3, stratify=y)

print(f"Regular - Train: {sum(y_train1)/len(y_train1):.1%}, Test: {sum(y_test1)/len(y_test1):.1%}")
print(f"Stratified - Train: {sum(y_train2)/len(y_train2):.1%}, Test: {sum(y_test2)/len(y_test2):.1%}")

This code simulates a medical dataset where only 5% of cases are positive. It then performs two types of splits to demonstrate a key difference.

  • The first split uses a standard train_test_split call, which randomly divides the data without considering class balance.
  • The second split adds the stratify=y parameter. This tells the function to preserve the original 5% class distribution in both the training and testing sets.

This comparison highlights why stratification is essential for imbalanced datasets, ensuring your model's evaluation is based on a fair representation of all classes.

Using TimeSeriesSplit for financial forecasting validation

For financial data, TimeSeriesSplit implements a walk-forward validation, where the model is sequentially trained on expanding windows of past data to predict future outcomes.

import numpy as np
from sklearn.model_selection import TimeSeriesSplit

# Simulate daily stock prices for 100 days
np.random.seed(42)
stock_prices = 100 + np.cumsum(np.random.normal(0.1, 1, 100))

# Create a walk-forward validation with 3 folds
tscv = TimeSeriesSplit(n_splits=3)

# Show the training and testing periods
for fold, (train_idx, test_idx) in enumerate(tscv.split(stock_prices)):
train_size = len(train_idx)
test_size = len(test_idx)
print(f"Fold {fold+1}: Train on days 1-{train_size}, predict days {train_size+1}-{train_size+test_size}")

This code shows how TimeSeriesSplit creates sequential folds for validation. After simulating stock price data, it initializes the splitter to create three distinct training and testing periods.

  • The key is that these splits aren't random. Each fold uses all data up to a certain point for training and the immediately following segment for testing.
  • This process ensures you're always predicting the "future" relative to your training data, which prevents data leakage from later time periods and provides a more realistic backtest.

Get started with Replit

Turn these techniques into a working tool. Describe what you want to build, like "a dashboard that backtests a trading strategy using TimeSeriesSplit" or "an app that prepares imbalanced medical data with stratified splitting."

Replit Agent will write the code, test for errors, and deploy your application from a simple description. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.