How to use XGBoost in Python

Master XGBoost in Python. This guide shows you different methods, tips, real-world applications, and how to debug common errors.

Published on:

Tue

Mar 17, 2026

Updated on:

Fri

Mar 20, 2026

The Replit Team

ON THIS PAGE

Example H2

XGBoost is a popular machine learning library known for its speed and performance. It provides a powerful framework for gradient boosting, a technique that builds predictive models with high accuracy.

In this article, you’ll learn essential techniques and tips to use XGBoost effectively. You'll find real-world applications and practical advice to help you debug your models and improve performance.

Basic classification with `XGBoost`

import xgboost as xgb from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = xgb.XGBClassifier() model.fit(X_train, y_train) print(f"Accuracy: {model.score(X_test, y_test):.4f}")--OUTPUT--Accuracy: 0.9333

This example runs through a typical machine learning workflow using the classic Iris dataset. After splitting the data, the process focuses on three core XGBoost actions:

First, you create an instance of the model with xgb.XGBClassifier().
Next, you train the model on your training data by calling the .fit() method.
Finally, you use .score() to check the model’s accuracy on the test set, which it has never seen before. This step is crucial for confirming its ability to generalize.

Core XGBoost functionality

While the classifier provides a great starting point, XGBoost's core functionality also includes powerful tools for regression, model tuning, and result interpretation.

Using `XGBRegressor` for regression tasks

import xgboost as xgb from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split X, y = fetch_california_housing(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) regressor = xgb.XGBRegressor() regressor.fit(X_train, y_train) print(f"R² Score: {regressor.score(X_test, y_test):.4f}")--OUTPUT--R² Score: 0.8214

For regression tasks, where you predict a continuous value like a price, you'll use XGBRegressor. The workflow mirrors the classification example—you still split your data and use the .fit() method to train the model.

The key difference is creating the model with xgb.XGBRegressor().
When you call .score(), it returns the R² score. This metric is standard for regression and shows how well the model's predictions approximate the real data points.

Tuning hyperparameters with `GridSearchCV`

import xgboost as xgb from sklearn.datasets import load_iris from sklearn.model_selection import GridSearchCV X, y = load_iris(return_X_y=True) params = {'max_depth': [3, 5, 7], 'learning_rate': [0.1, 0.01]} model = GridSearchCV(xgb.XGBClassifier(), params, cv=3) model.fit(X, y) print(f"Best parameters: {model.best_params_}") print(f"Best score: {model.best_score_:.4f}")--OUTPUT--Best parameters: {'learning_rate': 0.1, 'max_depth': 3} Best score: 0.9600

Finding the right settings, or hyperparameters, is key to your model's performance. Instead of guessing, you can use scikit-learn's GridSearchCV to automate the search. You define a params dictionary with the settings you want to test, like max_depth and learning_rate.

GridSearchCV systematically tries every combination of your specified parameters.
It uses cross-validation (cv=3) to get a reliable performance score for each combination.

After fitting, model.best_params_ reveals the optimal settings, helping you build a more accurate model without the manual guesswork.

Visualizing feature importance

import xgboost as xgb import matplotlib.pyplot as plt from sklearn.datasets import load_iris X, y = load_iris(return_X_y=True) model = xgb.XGBClassifier() model.fit(X, y) plt.barh(range(X.shape[1]), model.feature_importances_) plt.yticks(range(X.shape[1]), [f'Feature {i}' for i in range(X.shape[1])]) plt.xlabel('Importance') plt.show()--OUTPUT--[A horizontal bar chart showing the relative importance of the 4 iris features]

Understanding which features your model values most is crucial for interpretation. After training, you can access the model.feature_importances_ attribute. This contains a score for each feature, indicating its contribution to the model's predictions.

The code uses matplotlib to plot these scores in a simple bar chart.
This visualization quickly reveals which features are most influential, helping you understand what drives the model's decisions and debug its behavior.

Advanced XGBoost techniques

With the fundamentals covered, you can now use advanced techniques to optimize performance, prevent overfitting, and handle more complex data.

Implementing early stopping to prevent overfitting

import xgboost as xgb from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) eval_set = [(X_train, y_train), (X_test, y_test)] model = xgb.XGBClassifier(n_estimators=100) model.fit(X_train, y_train, eval_set=eval_set, eval_metric="mlogloss", early_stopping_rounds=10, verbose=False) print(f"Best iteration: {model.best_iteration}")--OUTPUT--Best iteration: 25

Overfitting happens when a model learns the training data too well, hurting its performance on new data. Early stopping is a practical way to prevent this by monitoring the model's performance on a separate validation set during training.

You pass this validation data to the .fit() method using the eval_set parameter.
The early_stopping_rounds=10 argument tells the model to stop if the validation score doesn't improve for 10 straight rounds.

This process finds the optimal number of training rounds—in this case, the best_iteration was 25—and avoids unnecessary training that could lead to overfitting.

Using native `DMatrix` for improved performance

import xgboost as xgb import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) dtrain = xgb.DMatrix(X_train, label=y_train) dtest = xgb.DMatrix(X_test, label=y_test) params = {'objective': 'multi:softmax', 'num_class': 3} model = xgb.train(params, dtrain, num_boost_round=10, evals=[(dtest, 'test')])--OUTPUT--[0] test-merror:0.033333 [1] test-merror:0.033333 [2] test-merror:0.033333 ...

For a performance boost, you can use XGBoost's native DMatrix data structure. It's optimized for memory and speed, which is especially useful for large datasets. You simply wrap your data and labels into a DMatrix object before training.

Instead of .fit(), you'll use the core xgb.train() function.
Model settings, like the objective, are passed in a params dictionary.
The function takes your DMatrix objects directly for training and evaluation.

This approach gives you more direct access to XGBoost's core API and can be more efficient than the scikit-learn wrapper.

Handling categorical features with `OneHotEncoder`

import xgboost as xgb import pandas as pd from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5], 'category': ['A', 'B', 'A', 'C', 'B'] }) y = [0, 1, 0, 1, 0] preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [1])], remainder='passthrough') pipeline = Pipeline([('preprocessor', preprocessor), ('model', xgb.XGBClassifier())]) pipeline.fit(data, y) print("Pipeline trained successfully")--OUTPUT--Pipeline trained successfully

XGBoost works with numerical data, so you need a way to handle text-based categorical features. This example uses a scikit-learn Pipeline to streamline the process. It combines preprocessing and model training into a single, clean workflow.

First, OneHotEncoder converts the non-numeric 'category' column into a numerical format.
ColumnTransformer ensures this transformation is applied only to the correct column.
Finally, the pipeline automatically feeds the processed data into the XGBClassifier for training, making your code more organized and reusable.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. You can take the concepts from this article and use Replit Agent to turn them into production-ready tools—complete with databases, APIs, and deployment.

For the XGBoost techniques we've explored, Replit Agent can build practical applications like these:

Build a home price prediction tool that uses XGBRegressor to estimate property values based on features like location and size.
Create a customer churn dashboard that uses XGBClassifier to identify at-risk users and visualizes the most important factors driving their decisions.
Deploy a product recommendation API that processes categorical data with a scikit-learn Pipeline and predicts user preferences.

Describe your app idea, and Replit Agent will write the code, test it, and fix issues automatically, all in your browser.

Common errors and challenges

Even powerful tools like XGBoost have common pitfalls; here’s how you can navigate a few frequent errors and challenges you might encounter.

You'll often see data type errors when your dataset contains non-numeric values like strings. XGBoost's algorithms require numerical inputs, so you must convert categorical features before training. You can use tools from scikit-learn like OneHotEncoder to transform categories into a numerical format the model can understand.

Another common issue is getting nan (Not a Number) values in your predictions. This usually happens when your input data contains missing values. XGBoost can't process them, so you need to handle them beforehand by either removing the rows with missing data or filling them in—a technique called imputation—using a placeholder like the column's mean or median.

Overfitting is a constant concern, but you have several ways to troubleshoot it beyond early stopping. Regularization is a key technique, and you can control it with XGBoost’s alpha and lambda parameters to penalize model complexity. You can also try these adjustments:

Decrease the max_depth parameter to make individual trees less complex and less likely to memorize the training data.
Use the subsample and colsample_bytree parameters to train on a random subset of data rows and features for each tree, which helps the model generalize better.

Fixing data type errors with `XGBClassifier`

One of the first errors you might see comes from data types. If you pass a dataset with string values directly to an XGBClassifier, it will fail because it only understands numbers. The code below shows exactly what this error looks like.

import xgboost as xgb import pandas as pd data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5], 'feature2': ['low', 'medium', 'high', 'medium', 'low'] }) target = [0, 1, 0, 1, 0] model = xgb.XGBClassifier() model.fit(data, target)

The fit method fails because the 'feature2' column contains string values like 'low' and 'medium'. XGBoost can't handle this non-numeric data directly. The following code shows how to prepare the data to resolve this error.

import xgboost as xgb import pandas as pd from sklearn.preprocessing import LabelEncoder data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5], 'feature2': ['low', 'medium', 'high', 'medium', 'low'] }) target = [0, 1, 0, 1, 0] le = LabelEncoder() data['feature2'] = le.fit_transform(data['feature2']) model = xgb.XGBClassifier() model.fit(data, target)

The fix is to convert the string values into numbers before training, a common step since real-world data often contains text.

The code uses scikit-learn's LabelEncoder to assign a unique integer to each category, like turning 'low' into 0 and 'medium' into 1.
The fit_transform method handles this conversion in one step, making the data ready for the model.

You'll want to watch for this error whenever your dataset includes descriptive text fields.

Resolving `nan` value errors in predictions

Seeing nan values in your model's output is a common issue. It almost always means the new data you're feeding the model contains missing values. XGBoost can't process these, so it returns nan instead of a valid prediction.

The code below shows what happens when you try to predict on data containing np.nan values.

import xgboost as xgb import numpy as np X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]]) y = [0, 1, 0, 1] model = xgb.XGBClassifier() model.fit(X, y) X_new = np.array([[np.nan, 2], [3, np.nan]]) predictions = model.predict(X_new) print(predictions)

The .predict() method fails because the X_new array contains np.nan values. Since the model can't process these missing entries, it returns nan. The code below shows how to prepare the data to fix this.

import xgboost as xgb import numpy as np X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]]) y = [0, 1, 0, 1] model = xgb.XGBClassifier(missing=np.nan) model.fit(X, y) X_new = np.array([[np.nan, 2], [3, np.nan]]) predictions = model.predict(X_new) print(predictions)

The fix is to tell XGBoost what to consider a missing value. By setting missing=np.nan when creating the model, you instruct it to automatically handle any np.nan entries it encounters during prediction.

This built-in feature saves you from manually cleaning the data beforehand.
You'll want to use this whenever your new data might have gaps, which is common in real-world datasets where information isn't always complete.

Troubleshooting overfitting in XGBoost models

Overfitting is a classic challenge where your model learns the training data too well, including its noise. You'll see high training accuracy but poor performance on unseen data, since the model can't generalize. The code below demonstrates this exact problem.

import xgboost as xgb from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split X, y = make_classification(n_samples=1000, n_features=20) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = xgb.XGBClassifier(n_estimators=1000, max_depth=15) model.fit(X_train, y_train) print(f"Training accuracy: {model.score(X_train, y_train):.4f}") print(f"Test accuracy: {model.score(X_test, y_test):.4f}")

The high values for n_estimators=1000 and max_depth=15 create an overly complex model that memorizes the training data instead of learning to generalize. The next example shows how to adjust these settings to resolve the issue.

import xgboost as xgb from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split X, y = make_classification(n_samples=1000, n_features=20) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = xgb.XGBClassifier( n_estimators=100, max_depth=4, learning_rate=0.1, reg_alpha=1, reg_lambda=1 ) model.fit(X_train, y_train) print(f"Training accuracy: {model.score(X_train, y_train):.4f}") print(f"Test accuracy: {model.score(X_test, y_test):.4f}")

The fix involves tuning several hyperparameters to make the model less complex and prevent it from memorizing the training data. This helps close the gap between training and test accuracy, creating a more reliable model.

Reducing n_estimators and max_depth limits the number and depth of the trees.
Adding regularization with reg_alpha and reg_lambda penalizes complexity, forcing the model to generalize better.

You'll want to watch for overfitting whenever your training accuracy is much higher than your test accuracy.

Real-world applications

Now that you can navigate common challenges, you can apply these techniques to build practical models for sentiment analysis and credit risk.

Sentiment analysis of movie reviews with `XGBoost`

By converting text from movie reviews into a numerical format, you can train an XGBClassifier to predict whether the sentiment is positive or negative.

import xgboost as xgb from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split # Sample movie reviews and their sentiment (1 = positive, 0 = negative) reviews = [ "This movie was fantastic and I loved it", "Worst film I've ever seen, terrible acting", "Great plot, excellent direction, amazing performances", "Boring story, bad effects, would not recommend" ] sentiments = [1, 0, 1, 0] # Convert text to numerical features vectorizer = CountVectorizer() X = vectorizer.fit_transform(reviews) X_train, X_test, y_train, y_test = train_test_split(X, sentiments, test_size=0.5) # Train XGBoost classifier model = xgb.XGBClassifier() model.fit(X_train, y_train) print(f"Sentiment prediction accuracy: {model.score(X_test, y_test):.4f}")

This example tackles text classification by first preparing the data. Since XGBoost requires numerical input, you'll use scikit-learn's CountVectorizer to handle the raw text.

The CountVectorizer analyzes the vocabulary in the reviews and converts each one into a vector of word counts.
After this transformation, the data is split for training and testing, and an XGBClassifier is trained on the resulting numerical data to make its predictions.

Creating a credit risk model with `feature_importances_`

You can use XGBoost's feature_importances_ attribute to build a credit risk model that predicts defaults while also identifying which factors, like income or credit score, are the most important.

import xgboost as xgb import numpy as np # Simulated credit data (age, income, credit_score, loan_amount) X = np.array([ [25, 50000, 680, 10000], [35, 70000, 720, 15000], [45, 55000, 600, 20000], [30, 60000, 650, 25000], [50, 80000, 750, 30000], [27, 45000, 630, 12000], [40, 90000, 790, 40000], [33, 65000, 710, 22000] ]) # Default status (1 = defaulted, 0 = paid) y = np.array([0, 0, 1, 1, 0, 1, 0, 0]) feature_names = ['age', 'income', 'credit_score', 'loan_amount'] # Train model model = xgb.XGBClassifier() model.fit(X, y) # Get most important feature for predicting credit risk top_feature = feature_names[np.argmax(model.feature_importances_)] print(f"Most important feature for credit risk: {top_feature}") print(f"Importance score: {max(model.feature_importances_):.4f}")

This example shows how to interpret an XGBoost model's logic using simulated credit data. After the model learns to predict loan defaults, you can investigate which data points it found most useful for making its decisions.

The model.feature_importances_ attribute contains a score for each feature, reflecting its predictive power.
By using np.argmax(), the code identifies the feature with the highest score.

This technique helps you understand what drives the model's output, revealing which factor—like income or loan amount—it relied on most heavily.

Get started with Replit

Turn your knowledge into a tool with Replit Agent. Try prompts like "build a sentiment analysis API for reviews" or "create a dashboard to visualize credit risk factors."

The agent writes the code, tests for errors, and deploys your app from a single prompt. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free