Gradient boosting classifiers in Scikit-Learn and Caret¶
Gradient boosting is a powerful and widely used machine learning algorithm in data science used for classification tasks. It's part of a family of ensemble learning methods, along with bagging, which combine the predictions of multiple simpler models to improve overall performance. Gradient boosting regression uses gradient boosting to better generate output data based on a linear regression. A gradient boosting classifier, which you’ll explore in this tutorial, uses gradient boosting to better classify input data as belonging to two or more different classes. Gradient boosting is an update of the adaboost algorithm that uses decision stumps rather than trees. These decision stumps are similar to trees in a random forest but they have only one node and two leaves. The gradient boosting algorithm builds models sequentially, each step tries to correct the mistakes of the previous iteration. The training process often begins with creating a weak learner like a shallow decision tree for the training data. After that initial training, gradient boosting computes the error between the actual and predicted values (often called residuals) and then trains a new estimator to predict this error. That new tree is added to the ensemble to update the predictions to create a strong learner. Gradient boosting repeats this process until improvement stops or until a fixed number of iterations has been reached. Boosting itself is similar to gradient descent but “descends” the gradient by introducing new models. Boosting has several advantages: it has good performance on tabular data and it can handle both numerical and categorical data. It works well even with default parameters and is robust to outliers in the dataset. However, it can be slow to train and often highly sensitive to the hyperparameters set for the training process. Keeping the number of trees created smaller can speed up the training process when working with a large dataset. This step is usually done through the max depth parameter. Gradient boosting can also be prone to overfitting if not tuned properly. To prevent overfitting, you can configure the learning rate for the training process. This process is roughly the same for a classifier or a gradient boosting regressor and is used in the popular xgboost, which builds on gradient boosting by adding regularization.
In this tutorial, you'll learn how to use two different programming languages and gradient boosting libraries to classify penguins by using the popular Palmer Penguins dataset.
Step 1 Create a Notebook using R¶
While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
Log in to watsonx.ai by using your IBM Cloud® account.
Create a watsonx.ai project.
You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.
Make sure to select "Runtime 24.1 on R 4.3 S (4 vCPU 16 GB RAM)" when you create the notebook. This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more Granite tutorials, check out the IBM Granite Community. This Jupyter Notebook can be found on GitHub.
Step 3 Configure Libraries and Data¶
In R the caret library is a powerful tool for general data preparation and for model fitting. You'll use it to prepare data and to train the model itself.
install.packages('gbm')
install.packages('caret')
install.packages('palmerpenguins')
library(gbm)
library(caret)
library(palmerpenguins)
head(penguins) # head() returns the top 6 rows of the dataframe
summary(penguins) # prints a statistical summary of the data columns
The createDataPartition function from the caret package to split the original dataset into a training and testing set and split data into training (70%) and testing set (30%).
dim(penguins)
# get rid of any NA
penguins <- na.omit(penguins)
parts = caret::createDataPartition(penguins$species, p = 0.7, list = F)
train = penguins[parts, ]
test = penguins[-parts, ]
Now you're ready to train and test.
Step 4 Train and Test¶
The train method from the caret library uses R formulas, where the dependent variable (often also called a target) is on the left hand side of a tilde '~' and the independent variables (often also called a features) are on the right hand side of the '~'. For instance:
height ~ age
This would predict height based on age.
To caret train, you pass the formula, the training data, and the method to be used. The caret library provides methods for many different types of training, so setting the method as "gbm" is where you'll specify to use gradient boosting. The next parameter configures the training process. The "repeatedcv" method performs X-fold cross-validation on subsamples of the training set data points. Here, you specify specify 3 repeats of 5-fold cross-validation, using a different set of folds for each cross-validation.
model_gbm <- caret::train("species ~ .",
data = train,
method = "gbm", # gbm for gradient boosting machine
trControl = trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
verboseIter = FALSE),
verbose = 0)
Now you can use the predictive model to make predictions on test data:
pred_test = caret::confusionMatrix(
data = predict(model_gbm, test),
reference = test$species
)
print(pred_test)
This will print:
Confusion Matrix and Statistics
Reference
Prediction Adelie Chinstrap Gentoo
Adelie 42 0 0
Chinstrap 0 20 0
Gentoo 1 0 35
Overall Statistics
Accuracy : 0.9898
95% CI : (0.9445, 0.9997)
No Information Rate : 0.4388
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.984
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Adelie Class: Chinstrap Class: Gentoo
Sensitivity 0.9767 1.0000 1.0000
Specificity 1.0000 1.0000 0.9841
Pos Pred Value 1.0000 1.0000 0.9722
Neg Pred Value 0.9821 1.0000 1.0000
Prevalence 0.4388 0.2041 0.3571
Detection Rate 0.4286 0.2041 0.3571
Detection Prevalence 0.4286 0.2041 0.3673
Balanced Accuracy 0.9884 1.0000 0.9921
Due to the nature of cross validation with folds the sensitivity and specificity for each class may be slightly different than what is observed here, although the accuracy will be the same. The accuracy is quite good, even with the Chinstrap penguin, which makes up on 20% of the training dataset.
Step 5 Create a Notebook in Python¶
Now you'll learn how to create a gradient boosting model in Python. In the same project that you created previously, Create a Jupyter Notebook. Make sure to create a Jupyter Notebook using Python 3.11 in Watson Studio. Make sure to select "Runtime 24.1 on Python 3.11 XXS (1 vCPU 4 GB RAM)" when you create the notebook. You're now ready to create a Gradient Boosting Classifier using Python.
Step 6 Configure Libraries and Data¶
This step install the libraries that you'll use to train and test your Gradient Boosting Classifier. The training itself is done with scikit-learn and the data comes from the palmerpenguins library.
!pip install seaborn pandas scikit-learn palmerpenguins
Now install the libraries:
import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from palmerpenguins import load_penguins
As in the R code, there are some NAs in the penguins dataset that need to be removed. This code snippet loads the dataset, removes any NA rows, and then splits the data into features and target.
# Load the penguins
penguins = load_penguins() #initialize the dataset
penguins = penguins.dropna()
X = penguins.drop("species", axis=1)
y = penguins["species"]
Now create a training and testing split of the dataset, with 70% of the data pulled for training and 30% reserved for testing.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
Next, you'll gather two lists of the column names, one for the categorical features of X and another for the numerical features, e.g. float64 or int64. Then, use ColumnTransformer from scikit-learn to apply different preprocessing to different column types. A OneHotEncoder will be applied to categorical features to convert them into binary vectors. A StandardScaler will be applied to numerical features to standardize them around a mean f 0 and a variance of 1.
# Define categorical and numerical features
categorical_features = X.select_dtypes(
include=["object"]
).columns.tolist()
numerical_features = X.select_dtypes(
include=["float64", "int64"]
).columns.tolist()
preprocessor = ColumnTransformer(
transformers=[
("cat", OneHotEncoder(), categorical_features),
("num", StandardScaler(), numerical_features),
]
)
Step 7 Train and Test¶
Now that you've created the feature sets and the prepocessor, you can create a pipeline to train the model. Other parameters you can configure are max_features, which sets the number of features to consider when looking for the best split. Also the criterion parameter, which measures the quality of a split for training. In this case we’re using the mean squared error with improvement score by Friedman
pipeline = Pipeline(
[
("preprocessor", preprocessor),
("classifier", GradientBoostingClassifier(random_state=42, criterion='friedman_mse', max_features=2)),
]
)
Next, perform cross-validation to evaluate how well your machine learning pipeline performs on the training data. Calling the fit method of the pipeline you created trains the model. The loss function uses Mean Squared Error or mse by default.
# Perform 5-fold cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
# Fit the model on the training data
pipeline.fit(X_train, y_train)
Now that the model has been trained, predict the test set and check the performance:
# Predict on the test set
y_pred = pipeline.predict(X_test)
# Generate classification report
report = classification_report(y_test, y_pred)
Print the results:
print(f"Mean Cross-Validation Accuracy: {cv_scores.mean():.4f}")
print("\nClassification Report:")
print(report)
This will print out the following:
Mean Cross-Validation Accuracy: 0.9775
Classification Report:
precision recall f1-score support
Adelie 1.00 1.00 1.00 31
Chinstrap 1.00 1.00 1.00 18
Gentoo 1.00 1.00 1.00 18
accuracy 1.00 67
macro avg 1.00 1.00 1.00 67
weighted avg 1.00 1.00 1.00 67
This is very close the accuracy reported by the R methods in the first part of this tutorial.