{tidymodels} and XGBoost
By Jesse Mostipak in technical
May 17, 2013
Note: this post was originally published as a Kaggle notebook
Introduction
This notebook is designed to provide an introductory-level approach to building an xgboost
model using {tidymodels}
. As you go through the notebook you’ll notice that there are a multitude of ways to improve upon this approach, including - but not limited to:
- splitting the data into training and validation sets
- evaluating the model with resampling
- tuning the model parameters
In its current state the model takes ~20 minutes to train. Please feel free to use this code as a baseline as you build out a more robust model for the Kaggle June 2021 Tabular Playground Series, and don’t hesitate to tag me in what you create – I’d love to see what you make!
Setup
Installing the {usemodels} package
I love the {usemodels}
package! As you’ll see a bit further in the notebook, we can use the {usemodels}
package to generate all of the boilerplate code that we need to build our model.
install.packages("usemodels")
Loading all of the packages
library(tidymodels) # metapackage of all the tidymodels packages
library(usemodels) # package that creates all of the boilerplate code for our model <3
library(tidyverse) # metapackage of all tidyverse packages - likely overkill for this project but that's OK
tidymodels_prefer() # "tells" R to use the tidymodels version of a function if there is a conflict
list.files(path = "../input")
Importing the data
train <- read_csv("../input/tabular-playground-series-jun-2021/train.csv")
test <- read_csv("../input/tabular-playground-series-jun-2021/test.csv")
sampsub <- read_csv("../input/tabular-playground-series-jun-2021/sample_submission.csv")
Inspecting the data
While I did not include any data visualizations in this notebook, I did want to include two of the functions I often use when looking at a new-to-me dataset.
dim()
will return the dimensions of our dataset, in this case we’re looking at 200,000 rows and 77 columns.glimpse()
will also return the dimensions of our dataset, as well as each column name, the data type in a given column, as well as data for the first few rows of the dataset. What we see with this particular dataset is that our id and feature_* columns are all numeric (doubles), while our target column is comprised of character data.
dim(train)
glimpse(train)
Wrangling the data
We’re not doing anything too fancy with our data at this point, but I did set up a quick mutate()
function to convert our target column from character
data to factor
data.
The code will actually convert all of the columns that are character
data into factor
data. In this case its only a single column, and we could have accomplished the same thing with mutate(target = as.factor(target)
, but I wanted to provide this example in case you ever need it in the future!
train_tidy <- train %>%
mutate(across(where(is.character), as.factor))
glimpse(train_tidy)
Building the xgboost model
We haven’t done any feature engineering, nor have we explored our data beyond using the dim()
and glimpse()
functions. For this first pass at modeling we’re simply going to throw all of our features into the model and see how well they predict the target using an xgboost
model.
To use the {usemodels}
package, we pull the function associated with the model we want to train, in this case xgboost.
We also provide the data argument to the function, and when we run the code we see that we get our recipe
, spec
, workflow
, and tune
code. We can then copy and paste what we need and alter it.
Getting boilerplate code for xgboost
The {usemodels}
package will provide boilerplate code for a host of models that can be used within the {tidymodels}
framework. You can read more about the package and how it works in the
{usemodels}
documentation.
# we're essentially saying:
# return the code for an xgboost model that predicts the target
# based on all of the features within the dataset,
# using the train_tidy tibble as our data source
use_xgboost(target ~ ., data = train_tidy)
Setting up the code for xgboost
There are a couple of things in the code that look different from the output of our previous line of code, use_xgboost(target ~ ., data = train_tidy)
.
You’ll notice that in the recipe
I’ve added in the line update_role(id, new_role = "id")
. This is because I want the id
column to be “carried along” so that I can access it later, but I do not want it to be included as a feature when I train my model.
In the spec
you’ll see that I’ve removed almost all of the tuning variables, and set trees = 100
. This is a first pass at the model, and at this point I’m not too worried about tuning the model, but I do encourage you to dig into the resources provided at the end of this notebook and try your hand at tuning!
Last but not least, I did not include the set.seed()
function or the code used to tune the model.
xgboost_recipe <-
recipe(formula = target ~ ., data = train_tidy) %>%
# step_zv removes variables that contain only a single value
step_zv(all_predictors()) %>%
update_role(id, new_role = "id")
xgboost_spec <-
boost_tree(trees = 100) %>%
set_mode("classification") %>%
set_engine("xgboost")
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
Training the xgboost model
xgb_fit <- xgboost_workflow %>%
fit(train_tidy)
Making predictions with the xgboost model
It’s important to note here that we’ve added type = 'prob'
to our predict()
function. If we don’t provide this, R is going to make its best guess as to how things should be calculated, and to make a long story short, you’ll end up with a single column of predictions with the Class
value listed. Instead what we want to know is the probability that a given id
belongs to a given Class
.
You can read more about type here.
Formatting the competition submission
You may have noticed that when we printed out test_predictions
in the previous code chunk, the column headers were formatted as follows: .pred_Class_*
. We’ll want to make sure we address this when we create our submission file, and make sure that our column headers match what Kaggle expects.
What format are the column headers in sample_submission.csv
?
names(sampsub)
What format are the column headers in test.csv
?
names(test)
What format are the column headers in test_predictions
?
names(test_predictions)
Converting headers in test_predictions
to match sample_submission.csv
test_pred_rename <- test_predictions %>%
rename_at(vars(starts_with(".pred_")), ~str_remove(., ".pred_"))
Creating our submission file
submission_01 <- test %>%
select(id) %>%
bind_cols(test_pred_rename)
Double-checking our work
glimpse(submission_01)
And done!
write_csv(submission_01, "submission_01.csv")
Resources
If you’d like to learn more about {tidymodels}
, I recommend the following:
-
Get Started with
{tidymodels}
, a series of five articles that introduce you to all the major components of building models with the{tidymodels}
framework. -
Julia Silge’s blog is an excellent resource. Every week Julia uses the #TidyTuesday dataset to create a
{tidymodels}
project that she breaks down in both a blog post as well as a screencast. -
Learn with
{tidymodels}
, a collection of worked examples that cover specific topics in modeling. -
{tidymodels}
package documentation provides links to documentation for each of the packages within the{tidymodels}
framework. -
Tidy Modeling with R, an open-source book that is currently being written by the
{tidymodels}
team.
- Posted on:
- May 17, 2013
- Length:
- 6 minute read, 1174 words
- Categories:
- technical
- Tags:
- tidymodels xgboost rstats kaggle