How to do one-hot encoding in Python
Learn how to do one-hot encoding in Python. Explore different methods, tips, real-world applications, and how to debug common errors.

One hot encoding is a vital data preparation technique for machine learning. It converts categorical data into a numerical format that models can understand, which improves prediction accuracy.
In this article, you'll explore several methods to implement one hot encoding in Python. You will find practical techniques, implementation tips, real-world applications, and clear advice to debug common issues.
Basic one-hot encoding with pandas.get_dummies()
import pandas as pd
data = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})
encoded_data = pd.get_dummies(data, columns=['color'])
print(encoded_data)--OUTPUT--color_blue color_green color_red
0 0 0 1
1 0 1 0
2 1 0 0
3 0 0 1
The pandas.get_dummies() function is the quickest way to perform one-hot encoding. It automatically detects the unique categories within the specified column—in this case, the color column—and generates a new binary column for each one.
The output replaces the original categorical data with new columns like color_blue and color_red. A 1 signifies the presence of that category in the original row, while a 0 marks its absence. This numerical format is ideal for machine learning models, which can't process text-based categories directly.
Built-in libraries for one-hot encoding
While pandas provides a straightforward approach, dedicated machine learning libraries offer more robust and flexible options for handling categorical data.
One-hot encoding with scikit-learn's OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
encoder = OneHotEncoder(sparse=False)
data = [['red'], ['green'], ['blue'], ['red']]
encoded_array = encoder.fit_transform(data)
print(encoded_array)--OUTPUT--[[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]]
Scikit-learn's OneHotEncoder is a powerful tool built for machine learning workflows. It separates the process into two stages: learning the unique categories with fit() and then converting the data with transform(). The fit_transform() method conveniently does both at once.
- This encoder works with 2D array-like data, such as the list of lists shown in the example.
- Using
sparse=Falsecreates a standard NumPy array. Without it, you'd get a sparse matrix, which is more memory-efficient but harder to read.
Using keras.utils.to_categorical for label encoding
import numpy as np
from tensorflow.keras.utils import to_categorical
# First convert categories to indices
labels = np.array(['red', 'green', 'blue', 'red'])
label_map = {'red': 0, 'green': 1, 'blue': 2}
indices = np.array([label_map[x] for x in labels])
one_hot = to_categorical(indices)
print(one_hot)--OUTPUT--[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]]
The keras.utils.to_categorical function is a go-to for deep learning tasks, particularly when preparing labels for classification models. Its main distinction is that it doesn't operate on raw text data—it requires integer inputs first.
- You must convert your string categories into numerical indices before using the function. The example does this manually with a dictionary named
label_map. - Once you have an array of integers,
to_categoricaltransforms it into the final one-hot encoded format, which is ready for a neural network.
One-hot encoding with tensorflow.one_hot()
import tensorflow as tf
# Convert categories to indices first
indices = [0, 1, 2, 0] # 0=red, 1=green, 2=blue
one_hot_tensor = tf.one_hot(indices, depth=3)
print(one_hot_tensor.numpy())--OUTPUT--[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]]
TensorFlow's tf.one_hot() function is built for deep learning workflows. Similar to the Keras utility, it operates on integer indices rather than raw text, so you'll need to map your categories to numbers first.
- The
depthparameter is essential—it tells the function the total number of unique categories, which defines the length of each one-hot vector. - This function outputs a TensorFlow
Tensor, the native data format for computations within the TensorFlow framework.
Advanced one-hot encoding techniques
When the standard library functions don't quite fit your needs, you can turn to these advanced techniques for more control, efficiency, and flexibility.
Custom one-hot encoding with NumPy
import numpy as np
def custom_one_hot(data, num_categories):
result = np.zeros((len(data), num_categories))
result[np.arange(len(data)), data] = 1
return result
indices = np.array([0, 1, 2, 0]) # 0=red, 1=green, 2=blue
one_hot = custom_one_hot(indices, 3)
print(one_hot)--OUTPUT--[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]]
Building your own one-hot encoding function with NumPy gives you maximum control over the process. Like the Keras and TensorFlow methods, this approach requires you to convert categories to integer indices first. The custom function then uses these indices to construct the final array.
- It begins by creating a matrix of zeros with
np.zeros(), correctly sized for your data. - The core of the function is its use of NumPy's advanced indexing. The line
result[np.arange(len(data)), data] = 1efficiently places a1at the correct column for each row, instantly creating the one-hot representation.
Memory-efficient one-hot encoding for large datasets
from scipy import sparse
import numpy as np
def sparse_one_hot(indices, num_categories):
n_samples = len(indices)
row_indices = np.arange(n_samples)
one_hot = sparse.csr_matrix((np.ones(n_samples), (row_indices, indices)),
shape=(n_samples, num_categories))
return one_hot
indices = np.array([0, 1, 2, 0])
sparse_matrix = sparse_one_hot(indices, 3)
print(sparse_matrix.toarray())--OUTPUT--[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]]
When you're working with large datasets, standard one-hot encoding can consume a lot of memory by storing countless zeros. This custom function uses SciPy to create a sparse matrix, a far more memory-efficient solution. It works by only recording the positions of the non-zero values—the 1s in this case—and ignoring everything else.
- The core of this method is
sparse.csr_matrix(), which builds a Compressed Sparse Row matrix. - This format is perfect for one-hot encoding since the resulting data is mostly zeros.
- To view the output as a standard array, you can call the
.toarray()method on the sparse matrix.
One-hot encoding with categorical features preservation
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'color': ['red', 'green', 'blue'], 'size': ['S', 'M', 'L']})
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['color', 'size']])
feature_names = encoder.get_feature_names_out(['color', 'size'])
encoded_df = pd.DataFrame(encoded, columns=feature_names)
print(encoded_df)--OUTPUT--color_blue color_green color_red size_L size_M size_S
0 0.0 0.0 1.0 0.0 0.0 1.0
1 0.0 1.0 0.0 0.0 1.0 0.0
2 1.0 0.0 0.0 1.0 0.0 0.0
When you encode multiple categorical columns, the output can become a confusing sea of ones and zeros. This technique preserves the original feature context, making your data much easier to interpret.
- The
OneHotEncoderprocesses multiple columns—likecolorandsize—simultaneously. - After fitting the data, you can call
get_feature_names_out(). This method generates clear, descriptive column names for the new DataFrame.
The result is a clean, labeled dataset that is ready for your model, with each column clearly tied to its original category.
Move faster with Replit
Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.
For the one-hot encoding techniques you've just learned, Replit Agent can turn them into production-ready tools. You can build applications that handle categorical data from start to finish.
- Build a data preprocessing utility that uses
pandas.get_dummies()to automatically convert categorical columns in a CSV file. - Create a sentiment analysis dashboard that uses one-hot encoding to prepare text labels for a classification model.
- Deploy a machine learning API that accepts raw categorical data and uses a
scikit-learnpipeline to encode it before making a prediction.
Describe your app idea to Replit Agent, and it will write the code, test it, and handle deployment, all from your browser.
Common errors and challenges
One-hot encoding is powerful, but you'll need to navigate a few common challenges to use it effectively.
Handling missing values with pd.get_dummies()
By default, when pd.get_dummies() encounters missing values (often represented as NaN), it creates a row of zeros across all the new encoded columns. This effectively ignores the missing data, which might not be what you want.
If you want your model to know that data was missing, you can set the dummy_na=True parameter. This tells the function to create an additional column specifically to flag rows where the original value was missing, turning that absence of data into a useful feature.
Dealing with new categories in OneHotEncoder
A common issue arises when your model encounters new categories in live data that weren't in the training set. Scikit-learn's OneHotEncoder will throw an error by default because it doesn't know how to handle the unfamiliar category.
You can prevent this by initializing the encoder with handle_unknown='ignore'. With this setting, if a new category appears, the encoder will simply output a row of all zeros for the encoded columns instead of crashing your program. This makes your data pipeline more resilient.
Avoiding multicollinearity with drop_first=True
One-hot encoding introduces multicollinearity, a situation where the new columns are perfectly predictable from one another. For example, if you have columns for color_red and color_blue, a 0 in both implies the original color must have been the third option, say, green. This redundancy can be a problem for some models, like linear regression.
To solve this, you can use the drop_first=True parameter in both pd.get_dummies() and OneHotEncoder. This removes the first category's column, breaking the perfect dependency. The model can then use the remaining columns as a baseline. It's important to note that while this is crucial for linear models, tree-based models like random forests are generally unaffected by multicollinearity.
Handling missing values with pd.get_dummies()
Missing values, like None or NaN, present a common challenge. The pd.get_dummies() function’s default response is to output all zeros for that row, effectively treating it as if it belongs to no category. The code below shows this in action.
import pandas as pd
data = pd.DataFrame({'color': ['red', 'green', None, 'blue']})
encoded_data = pd.get_dummies(data, columns=['color'])
print(encoded_data)
Because the None value becomes a row of all zeros, your model loses the important information that the data was missing. The next example shows how you can preserve this information as a distinct feature.
import pandas as pd
data = pd.DataFrame({'color': ['red', 'green', None, 'blue']})
encoded_data = pd.get_dummies(data, columns=['color'], dummy_na=True)
print(encoded_data)
The solution is to set the dummy_na=True parameter. This instructs pd.get_dummies() to create an extra column that explicitly flags missing data. In the example, a 1 appears in the new color_nan column where the original value was None. This approach is valuable when the absence of data is itself a predictive feature you want your model to learn from.
Dealing with new categories in OneHotEncoder
A OneHotEncoder learns categories from your training data. When it encounters a new category in your test data, it doesn't know what to do and will raise an error by default. The following code demonstrates this exact scenario.
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
train_data = pd.DataFrame({'color': ['red', 'green', 'blue']})
encoder = OneHotEncoder(sparse=False)
encoder.fit(train_data)
test_data = pd.DataFrame({'color': ['yellow']})
encoded_test = encoder.transform(test_data)
The encoder learns the categories 'red', 'green', and 'blue' from the training data. When it tries to transform 'yellow'—a category it has never seen before—it fails. The code below shows how to adjust the encoder to handle this.
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
train_data = pd.DataFrame({'color': ['red', 'green', 'blue']})
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoder.fit(train_data)
test_data = pd.DataFrame({'color': ['yellow']})
encoded_test = encoder.transform(test_data)
The solution is to initialize the encoder with handle_unknown='ignore'. This setting makes your data pipeline more resilient by telling the encoder to output a row of all zeros when it encounters an unknown category, instead of raising an error. It’s a crucial adjustment for production environments where your model will inevitably see data that wasn't in the original training set, preventing your application from crashing and ensuring it can handle unexpected inputs gracefully.
Avoiding multicollinearity with drop_first=True
One-hot encoding creates new columns that are perfectly predictable from one another, a problem called multicollinearity. For example, if a row isn't 'red' or 'green', it must be 'blue'. This redundancy can confuse some models, like linear regression.
The default behavior of pd.get_dummies() introduces this issue by creating a column for every category. The following code demonstrates how this leads to a dataset where one column's value can be inferred from the others.
import pandas as pd
data = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})
encoded_data = pd.get_dummies(data, columns=['color'])
print(encoded_data)
Here, pd.get_dummies() creates columns for all three colors. This introduces redundancy because if a row is not red and not green, it must be blue. The next example shows how to correct this behavior.
import pandas as pd
data = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})
encoded_data = pd.get_dummies(data, columns=['color'], drop_first=True)
print(encoded_data)
By setting drop_first=True, you instruct pd.get_dummies() to remove the first alphabetical category column, in this case color_blue. This simple change resolves the multicollinearity issue by breaking the perfect dependency between the new columns. The model can still infer the dropped category—if all other encoded columns are 0, the original value must have been the one that was removed. This is crucial for linear models but less of a concern for tree-based algorithms.
Real-world applications
Now that you can navigate the common pitfalls, you can apply one-hot encoding to powerful real-world systems for text classification and recommendations.
Using one-hot encoding for text classification
You'll often use one-hot encoding for text classification to convert categorical labels, such as 'support' or 'complaint', into a binary format that a model can understand.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Sample categories of text messages
categories = [['support'], ['complaint'], ['feedback'], ['complaint']]
# One-hot encode the categories for classification
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(categories)
print(encoded)
This code uses scikit-learn's OneHotEncoder to convert a list of text-based categories into a numerical format. The fit_transform() method is a two-in-one operation that handles the entire conversion process.
- First, it scans the data to learn the unique categories:
'complaint','feedback', and'support'. - Then, it creates a new array where each row corresponds to an original label, placing a
1in the column for the matching category.
Using sparse=False ensures the output is a standard array that's easy to inspect, rather than a memory-efficient sparse matrix.
One-hot encoding for movie recommendation systems
Recommendation engines often use one-hot encoding to turn user preferences, like liked movie genres, into numerical vectors for calculating similarity.
import pandas as pd
import numpy as np
# User movie preferences (1=liked, 0=not watched)
users = pd.DataFrame({
'action': [1, 0, 1],
'comedy': [0, 1, 1],
'drama': [1, 1, 0]
}, index=['User1', 'User2', 'User3'])
# Calculate similarity between User1 and other users
similarity = np.dot(users, users.loc['User1']) / (np.linalg.norm(users, axis=1) * np.linalg.norm(users.loc['User1']))
print(pd.Series(similarity, index=users.index))
This code measures how similar each user's movie taste is to User1's. The user preferences are stored as numerical vectors in a pandas DataFrame, making them ready for comparison. The core of the logic is a cosine similarity calculation.
- It uses
np.dot()to compute the dot product between User1's vector and every other user's. - This value is then normalized by the product of the vector magnitudes, found using
np.linalg.norm().
The final output ranks each user's similarity to User1, with a score of 1.0 indicating identical taste.
Get started with Replit
Now, turn what you’ve learned into a real tool. Tell Replit Agent what you want to build, like “a data cleaning utility using pandas.get_dummies()” or “an API that one-hot encodes JSON data.”
The agent writes the code, tests for errors, and deploys your app directly from your browser. Start building with Replit.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.


.png)
.png)