How to use XGBoost in Python
Master XGBoost in Python. This guide shows you different methods, tips, real-world applications, and how to debug common errors.

XGBoost is a popular machine learning library known for its speed and performance. It provides a powerful framework for gradient boosting, a technique that builds predictive models with high accuracy.
In this article, you’ll learn essential techniques and tips to use XGBoost effectively. You'll find real-world applications and practical advice to help you debug your models and improve performance.
Basic classification with XGBoost
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.4f}")--OUTPUT--Accuracy: 0.9333
This example runs through a typical machine learning workflow using the classic Iris dataset. After splitting the data, the process focuses on three core XGBoost actions:
- First, you create an instance of the model with
xgb.XGBClassifier(). - Next, you train the model on your training data by calling the
.fit()method. - Finally, you use
.score()to check the model’s accuracy on the test set, which it has never seen before. This step is crucial for confirming its ability to generalize.
Core XGBoost functionality
While the classifier provides a great starting point, XGBoost's core functionality also includes powerful tools for regression, model tuning, and result interpretation.
Using XGBRegressor for regression tasks
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
regressor = xgb.XGBRegressor()
regressor.fit(X_train, y_train)
print(f"R² Score: {regressor.score(X_test, y_test):.4f}")--OUTPUT--R² Score: 0.8214
For regression tasks, where you predict a continuous value like a price, you'll use XGBRegressor. The workflow mirrors the classification example—you still split your data and use the .fit() method to train the model.
- The key difference is creating the model with
xgb.XGBRegressor(). - When you call
.score(), it returns the R² score. This metric is standard for regression and shows how well the model's predictions approximate the real data points.
Tuning hyperparameters with GridSearchCV
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
X, y = load_iris(return_X_y=True)
params = {'max_depth': [3, 5, 7], 'learning_rate': [0.1, 0.01]}
model = GridSearchCV(xgb.XGBClassifier(), params, cv=3)
model.fit(X, y)
print(f"Best parameters: {model.best_params_}")
print(f"Best score: {model.best_score_:.4f}")--OUTPUT--Best parameters: {'learning_rate': 0.1, 'max_depth': 3}
Best score: 0.9600
Finding the right settings, or hyperparameters, is key to your model's performance. Instead of guessing, you can use scikit-learn's GridSearchCV to automate the search. You define a params dictionary with the settings you want to test, like max_depth and learning_rate.
GridSearchCVsystematically tries every combination of your specified parameters.- It uses cross-validation (
cv=3) to get a reliable performance score for each combination.
After fitting, model.best_params_ reveals the optimal settings, helping you build a more accurate model without the manual guesswork.
Visualizing feature importance
import xgboost as xgb
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
model = xgb.XGBClassifier()
model.fit(X, y)
plt.barh(range(X.shape[1]), model.feature_importances_)
plt.yticks(range(X.shape[1]), [f'Feature {i}' for i in range(X.shape[1])])
plt.xlabel('Importance')
plt.show()--OUTPUT--[A horizontal bar chart showing the relative importance of the 4 iris features]
Understanding which features your model values most is crucial for interpretation. After training, you can access the model.feature_importances_ attribute. This contains a score for each feature, indicating its contribution to the model's predictions.
- The code uses
matplotlibto plot these scores in a simple bar chart. - This visualization quickly reveals which features are most influential, helping you understand what drives the model's decisions and debug its behavior.
Advanced XGBoost techniques
With the fundamentals covered, you can now use advanced techniques to optimize performance, prevent overfitting, and handle more complex data.
Implementing early stopping to prevent overfitting
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
eval_set = [(X_train, y_train), (X_test, y_test)]
model = xgb.XGBClassifier(n_estimators=100)
model.fit(X_train, y_train, eval_set=eval_set, eval_metric="mlogloss", early_stopping_rounds=10, verbose=False)
print(f"Best iteration: {model.best_iteration}")--OUTPUT--Best iteration: 25
Overfitting happens when a model learns the training data too well, hurting its performance on new data. Early stopping is a practical way to prevent this by monitoring the model's performance on a separate validation set during training.
- You pass this validation data to the
.fit()method using theeval_setparameter. - The
early_stopping_rounds=10argument tells the model to stop if the validation score doesn't improve for 10 straight rounds.
This process finds the optimal number of training rounds—in this case, the best_iteration was 25—and avoids unnecessary training that could lead to overfitting.
Using native DMatrix for improved performance
import xgboost as xgb
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
params = {'objective': 'multi:softmax', 'num_class': 3}
model = xgb.train(params, dtrain, num_boost_round=10, evals=[(dtest, 'test')])--OUTPUT--[0] test-merror:0.033333
[1] test-merror:0.033333
[2] test-merror:0.033333
...
For a performance boost, you can use XGBoost's native DMatrix data structure. It's optimized for memory and speed, which is especially useful for large datasets. You simply wrap your data and labels into a DMatrix object before training.
- Instead of
.fit(), you'll use the corexgb.train()function. - Model settings, like the
objective, are passed in aparamsdictionary. - The function takes your
DMatrixobjects directly for training and evaluation.
This approach gives you more direct access to XGBoost's core API and can be more efficient than the scikit-learn wrapper.
Handling categorical features with OneHotEncoder
import xgboost as xgb
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
data = pd.DataFrame({
'feature1': [1, 2, 3, 4, 5],
'category': ['A', 'B', 'A', 'C', 'B']
})
y = [0, 1, 0, 1, 0]
preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [1])], remainder='passthrough')
pipeline = Pipeline([('preprocessor', preprocessor), ('model', xgb.XGBClassifier())])
pipeline.fit(data, y)
print("Pipeline trained successfully")--OUTPUT--Pipeline trained successfully
XGBoost works with numerical data, so you need a way to handle text-based categorical features. This example uses a scikit-learn Pipeline to streamline the process. It combines preprocessing and model training into a single, clean workflow.
- First,
OneHotEncoderconverts the non-numeric'category'column into a numerical format. ColumnTransformerensures this transformation is applied only to the correct column.- Finally, the pipeline automatically feeds the processed data into the
XGBClassifierfor training, making your code more organized and reusable.
Move faster with Replit
Replit is an AI-powered development platform that transforms natural language into working applications. You can take the concepts from this article and use Replit Agent to turn them into production-ready tools—complete with databases, APIs, and deployment.
For the XGBoost techniques we've explored, Replit Agent can build practical applications like these:
- Build a home price prediction tool that uses
XGBRegressorto estimate property values based on features like location and size. - Create a customer churn dashboard that uses
XGBClassifierto identify at-risk users and visualizes the most important factors driving their decisions. - Deploy a product recommendation API that processes categorical data with a scikit-learn
Pipelineand predicts user preferences.
Describe your app idea, and Replit Agent will write the code, test it, and fix issues automatically, all in your browser.
Common errors and challenges
Even powerful tools like XGBoost have common pitfalls; here’s how you can navigate a few frequent errors and challenges you might encounter.
You'll often see data type errors when your dataset contains non-numeric values like strings. XGBoost's algorithms require numerical inputs, so you must convert categorical features before training. You can use tools from scikit-learn like OneHotEncoder to transform categories into a numerical format the model can understand.
Another common issue is getting nan (Not a Number) values in your predictions. This usually happens when your input data contains missing values. XGBoost can't process them, so you need to handle them beforehand by either removing the rows with missing data or filling them in—a technique called imputation—using a placeholder like the column's mean or median.
Overfitting is a constant concern, but you have several ways to troubleshoot it beyond early stopping. Regularization is a key technique, and you can control it with XGBoost’s alpha and lambda parameters to penalize model complexity. You can also try these adjustments:
- Decrease the
max_depthparameter to make individual trees less complex and less likely to memorize the training data. - Use the
subsampleandcolsample_bytreeparameters to train on a random subset of data rows and features for each tree, which helps the model generalize better.
Fixing data type errors with XGBClassifier
One of the first errors you might see comes from data types. If you pass a dataset with string values directly to an XGBClassifier, it will fail because it only understands numbers. The code below shows exactly what this error looks like.
import xgboost as xgb
import pandas as pd
data = pd.DataFrame({
'feature1': [1, 2, 3, 4, 5],
'feature2': ['low', 'medium', 'high', 'medium', 'low']
})
target = [0, 1, 0, 1, 0]
model = xgb.XGBClassifier()
model.fit(data, target)
The fit method fails because the 'feature2' column contains string values like 'low' and 'medium'. XGBoost can't handle this non-numeric data directly. The following code shows how to prepare the data to resolve this error.
import xgboost as xgb
import pandas as pd
from sklearn.preprocessing import LabelEncoder
data = pd.DataFrame({
'feature1': [1, 2, 3, 4, 5],
'feature2': ['low', 'medium', 'high', 'medium', 'low']
})
target = [0, 1, 0, 1, 0]
le = LabelEncoder()
data['feature2'] = le.fit_transform(data['feature2'])
model = xgb.XGBClassifier()
model.fit(data, target)
The fix is to convert the string values into numbers before training, a common step since real-world data often contains text.
- The code uses scikit-learn's
LabelEncoderto assign a unique integer to each category, like turning 'low' into 0 and 'medium' into 1. - The
fit_transformmethod handles this conversion in one step, making the data ready for the model.
You'll want to watch for this error whenever your dataset includes descriptive text fields.
Resolving nan value errors in predictions
Seeing nan values in your model's output is a common issue. It almost always means the new data you're feeding the model contains missing values. XGBoost can't process these, so it returns nan instead of a valid prediction.
The code below shows what happens when you try to predict on data containing np.nan values.
import xgboost as xgb
import numpy as np
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = [0, 1, 0, 1]
model = xgb.XGBClassifier()
model.fit(X, y)
X_new = np.array([[np.nan, 2], [3, np.nan]])
predictions = model.predict(X_new)
print(predictions)
The .predict() method fails because the X_new array contains np.nan values. Since the model can't process these missing entries, it returns nan. The code below shows how to prepare the data to fix this.
import xgboost as xgb
import numpy as np
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = [0, 1, 0, 1]
model = xgb.XGBClassifier(missing=np.nan)
model.fit(X, y)
X_new = np.array([[np.nan, 2], [3, np.nan]])
predictions = model.predict(X_new)
print(predictions)
The fix is to tell XGBoost what to consider a missing value. By setting missing=np.nan when creating the model, you instruct it to automatically handle any np.nan entries it encounters during prediction.
- This built-in feature saves you from manually cleaning the data beforehand.
- You'll want to use this whenever your new data might have gaps, which is common in real-world datasets where information isn't always complete.
Troubleshooting overfitting in XGBoost models
Overfitting is a classic challenge where your model learns the training data too well, including its noise. You'll see high training accuracy but poor performance on unseen data, since the model can't generalize. The code below demonstrates this exact problem.
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = xgb.XGBClassifier(n_estimators=1000, max_depth=15)
model.fit(X_train, y_train)
print(f"Training accuracy: {model.score(X_train, y_train):.4f}")
print(f"Test accuracy: {model.score(X_test, y_test):.4f}")
The high values for n_estimators=1000 and max_depth=15 create an overly complex model that memorizes the training data instead of learning to generalize. The next example shows how to adjust these settings to resolve the issue.
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = xgb.XGBClassifier(
n_estimators=100,
max_depth=4,
learning_rate=0.1,
reg_alpha=1,
reg_lambda=1
)
model.fit(X_train, y_train)
print(f"Training accuracy: {model.score(X_train, y_train):.4f}")
print(f"Test accuracy: {model.score(X_test, y_test):.4f}")
The fix involves tuning several hyperparameters to make the model less complex and prevent it from memorizing the training data. This helps close the gap between training and test accuracy, creating a more reliable model.
- Reducing
n_estimatorsandmax_depthlimits the number and depth of the trees. - Adding regularization with
reg_alphaandreg_lambdapenalizes complexity, forcing the model to generalize better.
You'll want to watch for overfitting whenever your training accuracy is much higher than your test accuracy.
Real-world applications
Now that you can navigate common challenges, you can apply these techniques to build practical models for sentiment analysis and credit risk.
Sentiment analysis of movie reviews with XGBoost
By converting text from movie reviews into a numerical format, you can train an XGBClassifier to predict whether the sentiment is positive or negative.
import xgboost as xgb
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
# Sample movie reviews and their sentiment (1 = positive, 0 = negative)
reviews = [
"This movie was fantastic and I loved it",
"Worst film I've ever seen, terrible acting",
"Great plot, excellent direction, amazing performances",
"Boring story, bad effects, would not recommend"
]
sentiments = [1, 0, 1, 0]
# Convert text to numerical features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews)
X_train, X_test, y_train, y_test = train_test_split(X, sentiments, test_size=0.5)
# Train XGBoost classifier
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
print(f"Sentiment prediction accuracy: {model.score(X_test, y_test):.4f}")
This example tackles text classification by first preparing the data. Since XGBoost requires numerical input, you'll use scikit-learn's CountVectorizer to handle the raw text.
- The
CountVectorizeranalyzes the vocabulary in thereviewsand converts each one into a vector of word counts. - After this transformation, the data is split for training and testing, and an
XGBClassifieris trained on the resulting numerical data to make its predictions.
Creating a credit risk model with feature_importances_
You can use XGBoost's feature_importances_ attribute to build a credit risk model that predicts defaults while also identifying which factors, like income or credit score, are the most important.
import xgboost as xgb
import numpy as np
# Simulated credit data (age, income, credit_score, loan_amount)
X = np.array([
[25, 50000, 680, 10000],
[35, 70000, 720, 15000],
[45, 55000, 600, 20000],
[30, 60000, 650, 25000],
[50, 80000, 750, 30000],
[27, 45000, 630, 12000],
[40, 90000, 790, 40000],
[33, 65000, 710, 22000]
])
# Default status (1 = defaulted, 0 = paid)
y = np.array([0, 0, 1, 1, 0, 1, 0, 0])
feature_names = ['age', 'income', 'credit_score', 'loan_amount']
# Train model
model = xgb.XGBClassifier()
model.fit(X, y)
# Get most important feature for predicting credit risk
top_feature = feature_names[np.argmax(model.feature_importances_)]
print(f"Most important feature for credit risk: {top_feature}")
print(f"Importance score: {max(model.feature_importances_):.4f}")
This example shows how to interpret an XGBoost model's logic using simulated credit data. After the model learns to predict loan defaults, you can investigate which data points it found most useful for making its decisions.
- The
model.feature_importances_attribute contains a score for each feature, reflecting its predictive power. - By using
np.argmax(), the code identifies the feature with the highest score.
This technique helps you understand what drives the model's output, revealing which factor—like income or loan amount—it relied on most heavily.
Get started with Replit
Turn your knowledge into a tool with Replit Agent. Try prompts like "build a sentiment analysis API for reviews" or "create a dashboard to visualize credit risk factors."
The agent writes the code, tests for errors, and deploys your app from a single prompt. Start building with Replit.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.


.png)
.png)