How to shuffle a dataframe in Python

Learn how to shuffle a Python DataFrame with our guide. Discover different methods, tips, real-world applications, and how to debug common errors.

Published on:

Wed

Mar 25, 2026

Updated on:

Thu

Mar 26, 2026

The Replit Team

ON THIS PAGE

Example H2

To shuffle a DataFrame in Python is a crucial technique for data randomization. It ensures unbiased model training and robust analysis when you reorder rows randomly.

In this article, you'll learn techniques to shuffle data with functions like sample(). You will also find practical tips, real-world applications, and advice to debug common issues.

Using `sample()` to shuffle a DataFrame

import pandas as pd df = pd.DataFrame({'A': range(1, 6), 'B': list('abcde')}) shuffled_df = df.sample(frac=1) print(shuffled_df)--OUTPUT--A B 3 4 d 0 1 a 4 5 e 2 3 c 1 2 b

The sample() method offers a direct way to shuffle your DataFrame. The key is the frac=1 parameter, which tells pandas to return a random sample containing 100% of the rows from the original DataFrame.

This effectively reorders all the rows without dropping any data.
The method is non-destructive. It returns a new shuffled DataFrame, leaving your original one untouched, which is ideal for preserving your initial data state.

Basic shuffling techniques

The sample() method offers more than just basic shuffling; you can also reset the index, ensure reproducibility, and even shuffle specific columns.

Using `sample()` and resetting the index

import pandas as pd df = pd.DataFrame({'A': range(1, 6), 'B': list('abcde')}) shuffled_df = df.sample(frac=1).reset_index(drop=True) print(shuffled_df)--OUTPUT--A B 0 5 e 1 2 b 2 1 a 3 3 c 4 4 d

After shuffling, the original index remains attached to the rows, creating a jumbled sequence. You can tidy this up by chaining the reset_index() method. This action generates a new, clean index starting from 0, making the DataFrame easier to work with.

The drop=True parameter is essential. It prevents pandas from adding the old index as a new column in your DataFrame.
This approach ensures you get a randomly ordered DataFrame with a neat, sequential index.

Using a random state for reproducible shuffling

import pandas as pd df = pd.DataFrame({'A': range(1, 6), 'B': list('abcde')}) shuffled_df = df.sample(frac=1, random_state=42) print(shuffled_df)--OUTPUT--A B 3 4 d 1 2 b 0 1 a 4 5 e 2 3 c

When you need your shuffling to be predictable, the random_state parameter is your best friend. By setting it to a specific integer, like random_state=42, you ensure that the sample() method produces the exact same "random" order every single time you run the code.

This is incredibly useful for debugging or when you need to share your work and have others reproduce your results.
The number itself doesn't matter—you could use 1, 99, or any other integer—as long as you use the same one consistently.

Shuffling specific columns only

import pandas as pd import numpy as np df = pd.DataFrame({'A': range(1, 6), 'B': list('abcde')}) df['B'] = np.random.permutation(df['B'].values) print(df)--OUTPUT--A B 0 1 c 1 2 e 2 3 a 3 4 b 4 5 d

If you only need to randomize a single column, you can combine pandas with the NumPy library. This approach gives you more granular control over the shuffling process.

The key is the np.random.permutation() function, which takes a column's values—extracted using .values—and shuffles them.
You then assign this newly shuffled array back to the original column, overwriting its previous order.
This method is perfect when you need to preserve the order of other columns while randomizing just one.

Advanced shuffling approaches

Building on the np.random.permutation() function, you can unlock more sophisticated shuffling techniques for entire DataFrames, grouped data, and even custom data blocks.

Using NumPy's `random.permutation()` for shuffling

import pandas as pd import numpy as np df = pd.DataFrame({'A': range(1, 6), 'B': list('abcde')}) shuffled_indices = np.random.permutation(len(df)) shuffled_df = df.iloc[shuffled_indices].reset_index(drop=True) print(shuffled_df)--OUTPUT--A B 0 3 c 1 5 e 2 1 a 3 4 d 4 2 b

For more granular control, you can use NumPy to shuffle your DataFrame's indices directly. The np.random.permutation() function creates a shuffled array of row numbers based on the DataFrame's length, which gives you a new, random order to apply.

You then use this array of shuffled indices with df.iloc[] to reorder the rows into the new sequence.
Finally, chaining .reset_index(drop=True) cleans up the index. This method offers a more manual yet powerful alternative to using sample().

Stratified shuffling by group

import pandas as pd df = pd.DataFrame({ 'category': ['A', 'A', 'B', 'B', 'B', 'C'], 'value': range(1, 7) }) shuffled_df = df.groupby('category').sample(frac=1).sample(frac=1) print(shuffled_df)--OUTPUT--category value 1 A 2 5 C 6 2 B 3 0 A 1 3 B 4 4 B 5

Stratified shuffling is perfect when you need to randomize data while respecting its group structure, like ensuring each category is represented fairly. This technique involves two key steps.

First, df.groupby('category').sample(frac=1) shuffles the rows within each distinct category. This randomizes the order of items inside group 'A', group 'B', and so on, but keeps the groups themselves separate.
Afterward, the groups are still clustered together. A second, chained .sample(frac=1) shuffles the entire DataFrame, mixing the already-shuffled groups into a completely random order.

Creating custom block-wise shuffling

import pandas as pd import numpy as np df = pd.DataFrame({'A': range(1, 11), 'B': list('abcdefghij')}) blocks = [df.iloc[i:i+2] for i in range(0, len(df), 2)] np.random.shuffle(blocks) block_shuffled = pd.concat(blocks).reset_index(drop=True) print(block_shuffled)--OUTPUT--A B 0 7 g 1 8 h 2 1 a 3 2 b 4 9 i 5 10 j 6 5 e 7 6 f 8 3 c 9 4 d

Block-wise shuffling is a technique for randomizing chunks of data while keeping the rows within each chunk together. This approach first slices the DataFrame into a list of smaller DataFrames, or "blocks," using a list comprehension with df.iloc.

The np.random.shuffle() function then reorders these blocks randomly.
Finally, pd.concat() stitches the shuffled blocks back into a single DataFrame.
Chaining .reset_index(drop=True) cleans up the index for a tidy result. This method is useful when you need to maintain the internal order of specific data segments.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

For the shuffling techniques we've explored, Replit Agent can turn them into production-ready tools:

Build an A/B testing tool that randomizes UI elements to gather unbiased user feedback.
Create a data sampling utility that performs stratified shuffling to ensure balanced class representation in training sets.
Deploy a study guide generator that shuffles topics while keeping related concepts grouped together in blocks.

Describe your app idea, and Replit Agent writes the code, tests it, and fixes issues automatically, all inside your browser.

Common errors and challenges

While shuffling DataFrames is straightforward, you might run into a few common challenges with indexing, reproducibility, and grouped data.

A frequent issue is the jumbled index that sample() leaves behind. Because the original index sticks to its rows, you end up with a non-sequential order that can complicate row selection. The fix is to chain .reset_index(drop=True), which generates a clean, new index. If you forget drop=True, pandas will add the old, messy index as a new column, which usually isn't what you want.
Another challenge is ensuring your shuffle is reproducible. By default, sample() produces a different result every time, which is a problem for debugging or sharing work. Using the random_state parameter with any integer value acts as a seed, guaranteeing the same "random" order on every run. This is essential for any analysis that needs to be validated or replicated by others.
Finally, applying sample() naively can break the integrity of grouped data. If your DataFrame contains categories that need to stay together, a simple shuffle will mix them all up. To shuffle rows only within their respective groups, you should first use groupby() before calling sample(). This preserves the stratified structure required for tasks like creating balanced machine learning datasets.

Dealing with index issues after using `sample()`

When you shuffle a DataFrame with sample(), the original index tags along with the rows. This mismatch can cause a KeyError if you try to access a row by its old integer position, like shuffled_df.loc[0], when that index is no longer at the top. The following code demonstrates how this common mistake can break your script.

import pandas as pd df = pd.DataFrame({'A': range(1, 6), 'B': list('abcde')}) shuffled_df = df.sample(frac=1) # This will raise KeyError if index 0 was shuffled away print(shuffled_df.loc[0])

The .loc accessor searches for the original index label, not its new position. If the row with label 0 is no longer at the top, the code fails to find it, triggering the error. The corrected approach below shows how to handle this.

import pandas as pd df = pd.DataFrame({'A': range(1, 6), 'B': list('abcde')}) shuffled_df = df.sample(frac=1).reset_index(drop=True) # Now we can safely access by position print(shuffled_df.loc[0])

The fix is to chain .reset_index(drop=True) after shuffling. This method discards the old, jumbled index and creates a new, clean one starting from 0. Now, when you use shuffled_df.loc[0], it reliably accesses the first row of the shuffled DataFrame, preventing the KeyError. This is crucial whenever you need to access rows by their integer position after randomizing your data, ensuring your code runs without unexpected indexing errors.

Ensuring reproducible shuffling with `random_state`

When you need to debug your code or share your analysis, getting a different random order every time is a major roadblock. The sample() function's default behavior creates this inconsistency, making it impossible to replicate results for validation or collaboration. The following code demonstrates how two separate shuffles on the same DataFrame will almost certainly produce different outcomes.

import pandas as pd df = pd.DataFrame({'A': range(1, 6), 'B': list('abcde')}) first_shuffle = df.sample(frac=1) second_shuffle = df.sample(frac=1) print("Same shuffle?", first_shuffle.equals(second_shuffle))

Because each sample() call generates a new, unpredictable sequence, the .equals() check will almost certainly return False. This makes results impossible to replicate. The corrected approach below ensures consistent shuffling every time.

import pandas as pd df = pd.DataFrame({'A': range(1, 6), 'B': list('abcde')}) first_shuffle = df.sample(frac=1, random_state=42) second_shuffle = df.sample(frac=1, random_state=42) print("Same shuffle?", first_shuffle.equals(second_shuffle))

The fix is to add the random_state parameter to your sample() call. By setting it to any integer, like random_state=42, you provide a seed for the randomization. This guarantees that every time you run the code, you get the exact same shuffled DataFrame. It's essential for debugging your code or when you need others to replicate your analysis, as the .equals() check will now return True.

Handling group integrity when using `sample()`

When you're working with categorical data, a simple sample() can unintentionally skew your results. It might overrepresent some groups while completely dropping others, breaking the data's original structure. The code below shows how this can lead to an unbalanced sample.

import pandas as pd df = pd.DataFrame({ 'group': ['A', 'A', 'B', 'B', 'C'], 'value': [1, 2, 3, 4, 5] }) shuffled = df.sample(frac=0.6) print(shuffled['group'].value_counts())

Because sample() pulls rows from the entire DataFrame, it’s blind to the ‘group’ column. This means it might accidentally drop a whole category from your data, skewing the results. The following code shows how to correct this.

import pandas as pd df = pd.DataFrame({ 'group': ['A', 'A', 'B', 'B', 'C'], 'value': [1, 2, 3, 4, 5] }) shuffled = df.groupby('group').sample(frac=0.6) print(shuffled['group'].value_counts())

The fix is to chain groupby('group') before calling sample(). This tells pandas to perform the sampling operation within each group independently, rather than across the entire DataFrame. As a result, each category is sampled proportionally, preserving the original data structure. This technique, known as stratified sampling, is crucial for creating balanced training sets in machine learning, ensuring your model isn't biased toward any single group.

Real-world applications

With a firm grasp on shuffling mechanics and error handling, you can apply these skills to critical data science applications.

Using `sample()` for training-test splits

Shuffling your DataFrame is a critical first step when splitting it into training and testing sets for machine learning.

import pandas as pd from sklearn.model_selection import train_test_split df = pd.DataFrame({'feature': range(100), 'target': [i % 2 for i in range(100)]}) train_df, test_df = train_test_split(df, test_size=0.3, random_state=42) print(f"Training set shape: {train_df.shape}") print(f"Test set shape: {test_df.shape}")

The train_test_split function from scikit-learn partitions your data into two sets. It automatically shuffles the rows before splitting to ensure the data is randomized, which helps prevent model bias.

The test_size=0.3 parameter reserves 30% of the data for the test set, leaving the rest for training.
Setting random_state=42 guarantees the same "random" split every time, which is essential for getting reproducible results.

Creating bootstrap samples with `sample()`

The sample() function, with its replace=True parameter, provides a straightforward way to perform bootstrapping—a resampling method for estimating statistical uncertainty.

import pandas as pd import numpy as np sales_df = pd.DataFrame({'sales': [120, 85, 190, 110, 155, 140, 180, 95, 130, 170]}) bootstrap_means = [sales_df.sample(frac=1, replace=True).mean()[0] for _ in range(1000)] confidence_interval = np.percentile(bootstrap_means, [2.5, 97.5]) print(f"Mean sales: {sales_df.sales.mean():.2f}") print(f"95% Confidence Interval: [{confidence_interval[0]:.2f}, {confidence_interval[1]:.2f}]")

This technique gauges how stable the average sales figure is. It uses a list comprehension to run a simulation 1,000 times, creating a list of different possible mean sales values.

In each run, sample(frac=1, replace=True) builds a new DataFrame by randomly picking rows from the original. Because replace=True is set, some rows might appear multiple times while others are skipped.
The code then calculates the mean of this new sample and stores it.

After generating 1,000 sample means, np.percentile() finds the range containing the central 95% of them. This gives you a confidence interval for the true average.

Get started with Replit

Now, turn these shuffling techniques into a real tool. Tell Replit Agent: “Build a utility that performs stratified sampling on a CSV file” or “Create a flashcard app that shuffles questions and answers.”

Replit Agent will write the code, test for errors, and deploy your application right from your browser. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free