{tidymodels} and XGBoost

By Jesse Mostipak in technical

May 17, 2013

Note: this post was originally published as a Kaggle notebook

Introduction

This notebook is designed to provide an introductory-level approach to building an xgboost model using {tidymodels}. As you go through the notebook you’ll notice that there are a multitude of ways to improve upon this approach, including - but not limited to:

  • splitting the data into training and validation sets
  • evaluating the model with resampling
  • tuning the model parameters

In its current state the model takes ~20 minutes to train. Please feel free to use this code as a baseline as you build out a more robust model for the Kaggle June 2021 Tabular Playground Series, and don’t hesitate to tag me in what you create – I’d love to see what you make!

Setup

Installing the {usemodels} package

I love the {usemodels} package! As you’ll see a bit further in the notebook, we can use the {usemodels} package to generate all of the boilerplate code that we need to build our model.

install.packages("usemodels")

Loading all of the packages

library(tidymodels)  # metapackage of all the tidymodels packages
library(usemodels)   # package that creates all of the boilerplate code for our model <3
library(tidyverse)   # metapackage of all tidyverse packages - likely overkill for this project but that's OK

tidymodels_prefer()  # "tells" R to use the tidymodels version of a function if there is a conflict

list.files(path = "../input")

Importing the data

train <- read_csv("../input/tabular-playground-series-jun-2021/train.csv")
test <- read_csv("../input/tabular-playground-series-jun-2021/test.csv")
sampsub <- read_csv("../input/tabular-playground-series-jun-2021/sample_submission.csv")

Inspecting the data

While I did not include any data visualizations in this notebook, I did want to include two of the functions I often use when looking at a new-to-me dataset.

  • dim() will return the dimensions of our dataset, in this case we’re looking at 200,000 rows and 77 columns.
  • glimpse() will also return the dimensions of our dataset, as well as each column name, the data type in a given column, as well as data for the first few rows of the dataset. What we see with this particular dataset is that our id and feature_* columns are all numeric (doubles), while our target column is comprised of character data.
dim(train)
glimpse(train)

Wrangling the data

We’re not doing anything too fancy with our data at this point, but I did set up a quick mutate() function to convert our target column from character data to factor data.

The code will actually convert all of the columns that are character data into factor data. In this case its only a single column, and we could have accomplished the same thing with mutate(target = as.factor(target), but I wanted to provide this example in case you ever need it in the future!

train_tidy <- train %>% 
  mutate(across(where(is.character), as.factor))
glimpse(train_tidy)

Building the xgboost model

We haven’t done any feature engineering, nor have we explored our data beyond using the dim() and glimpse() functions. For this first pass at modeling we’re simply going to throw all of our features into the model and see how well they predict the target using an xgboost model.

To use the {usemodels} package, we pull the function associated with the model we want to train, in this case xgboost. We also provide the data argument to the function, and when we run the code we see that we get our recipe, spec, workflow, and tune code. We can then copy and paste what we need and alter it.

Getting boilerplate code for xgboost

The {usemodels} package will provide boilerplate code for a host of models that can be used within the {tidymodels} framework. You can read more about the package and how it works in the {usemodels} documentation.

# we're essentially saying: 
#   return the code for an xgboost model that predicts the target 
#   based on all of the features within the dataset, 
#   using the train_tidy tibble as our data source

use_xgboost(target ~ ., data = train_tidy)

Setting up the code for xgboost

There are a couple of things in the code that look different from the output of our previous line of code, use_xgboost(target ~ ., data = train_tidy).

You’ll notice that in the recipe I’ve added in the line update_role(id, new_role = "id"). This is because I want the id column to be “carried along” so that I can access it later, but I do not want it to be included as a feature when I train my model.

In the spec you’ll see that I’ve removed almost all of the tuning variables, and set trees = 100. This is a first pass at the model, and at this point I’m not too worried about tuning the model, but I do encourage you to dig into the resources provided at the end of this notebook and try your hand at tuning!

Last but not least, I did not include the set.seed() function or the code used to tune the model.

xgboost_recipe <- 
  recipe(formula = target ~ ., data = train_tidy) %>% 
  # step_zv removes variables that contain only a single value
  step_zv(all_predictors()) %>% 
  update_role(id, new_role = "id")

xgboost_spec <- 
  boost_tree(trees = 100) %>% 
  set_mode("classification") %>% 
  set_engine("xgboost") 

xgboost_workflow <- 
  workflow() %>% 
  add_recipe(xgboost_recipe) %>% 
  add_model(xgboost_spec)

Training the xgboost model

xgb_fit <- xgboost_workflow %>% 
  fit(train_tidy)

Making predictions with the xgboost model

It’s important to note here that we’ve added type = 'prob' to our predict() function. If we don’t provide this, R is going to make its best guess as to how things should be calculated, and to make a long story short, you’ll end up with a single column of predictions with the Class value listed. Instead what we want to know is the probability that a given id belongs to a given Class.

You can read more about type here.

Formatting the competition submission

You may have noticed that when we printed out test_predictions in the previous code chunk, the column headers were formatted as follows: .pred_Class_*. We’ll want to make sure we address this when we create our submission file, and make sure that our column headers match what Kaggle expects.

What format are the column headers in sample_submission.csv?
names(sampsub)

What format are the column headers in test.csv?
names(test)

What format are the column headers in test_predictions?
names(test_predictions)

Converting headers in test_predictions to match sample_submission.csv

test_pred_rename <- test_predictions %>% 
  rename_at(vars(starts_with(".pred_")), ~str_remove(., ".pred_"))

Creating our submission file

submission_01 <- test %>% 
  select(id) %>% 
  bind_cols(test_pred_rename)

Double-checking our work glimpse(submission_01)

And done! write_csv(submission_01, "submission_01.csv")

Resources

If you’d like to learn more about {tidymodels}, I recommend the following:

  • Get Started with {tidymodels}, a series of five articles that introduce you to all the major components of building models with the {tidymodels} framework.
  • Julia Silge’s blog is an excellent resource. Every week Julia uses the #TidyTuesday dataset to create a {tidymodels} project that she breaks down in both a blog post as well as a screencast.
  • Learn with {tidymodels}, a collection of worked examples that cover specific topics in modeling.
  • {tidymodels} package documentation provides links to documentation for each of the packages within the {tidymodels} framework.
  • Tidy Modeling with R, an open-source book that is currently being written by the {tidymodels} team.
Posted on:
May 17, 2013
Length:
6 minute read, 1174 words
Categories:
technical
Tags:
tidymodels xgboost rstats kaggle
See Also:
Truss + XGBoost for Rapid Model Deployment
TPUs + Cassava Leaf Disease
Dive into {dplyr}