Dive into {dplyr}

By Jesse Mostipak in technical

May 1, 2020

Note: this post was originally published as a Kaggle notebook

Introduction: why {dplyr}?

There are a lot of amazing packages in the Tidyverse, but {dplyr} is hands-down my absolute favorite package. I use {dplyr} when I’m cleaning and exploring my dataset, and what I particularly love is that after I get a good handle on my dataset with {dplyr}, I can feed the various manipulations I’ve creating into the {ggplot2} package for visualization.

This tutorial is for anyone interested in learning the basics of the {dplyr} package. We’ll be focusing on data exploration and manipulation, building off of the examples in the {dplyr} package documentation using the Palmer Penguins dataset.

By the end of this notebook, you’ll be able to:

Demonstrate what each of the main five dplyr functions does
Use the pipe operator %>% to chain together multiple dplyr functions

My analytical workflow

We won’t be covering all of the steps in my workflow in this tutorial, but in general I follow these steps:

Set up the programming environment by loading packages
Import my data
Check out my data
Explore my data
Model my data
Communicate what I’ve learned

Set up our environment

# we have a couple of options here - we can load the entire tidyverse or we can just load the 
# tidyverse packages that we're interested in using. I'm going to load the tidyverse, but alternatively you
# could run the following instead:

#library(readr)
#library(dplyr)

# load the tidyverse
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ stringr 1.5.0
## ✔ tidyr   1.3.0     ✔ forcats 0.5.2
## ✔ readr   2.1.3     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

A quick note on Conflicts

After running library(tidyverse) you might have noticed that the print out told us which packages were attached successfully (all of them, as evidenced by the green check marks), and where we have conflicts (the red x’s).

Conflicts aren’t necessarily a bad thing! Because R is an open source language and anyone can create a package, it’s common for different packages to use the same name for similar functions. In our conflicts we see that the filter() function from the {dplyr} package masks the filter() function from the {stats} package. We know this because the package name comes before the double colon and the function name comes after, like this:

package::function()

What if we want to use the filter() function from the {stats} package? All is not lost! What we can do in our code is use the full package::function() syntax and R will know to use the filter() function from the {stats} package instead of the {dplyr} package.

Import our data

penguins <- read_csv('../input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv')

Parsing the parsing statement

One thing that took me awhile to get used to was that just because the text is in BRIGHT RED doesn’t mean that something bad has happened, or that I’ve made a mistake. And that’s the same as what we’re seeing here!

What the parsing statement does is tell us how R formatted each of the columns in our dataframe. The read_csv() function looks at the first thousand rows of a dataset and makes an educated guess as to what the remaining rows are. We can override this if we need to, either by telling R to use more rows to guess using the guess_max argument, or by explicitly telling R what type of data is in each column.

Check out our data

Here’s where I like to get a handle on what I’m working with. I’ll use various functions to make sure my data imported correctly, and start to get an understanding of the data structure and data types. The functions I commonly use to accomplish this are:

glimpse()
head() and tail()
summary()

`glimpse()` is grrrreat!

glimpse() gives you just about everything you could want, all wrapped up in a single function. We get our dataframe structure with the printout to rows and columns, telling us that in our penguins dataset we have 344 rows (or observations) and 7 columns (or variables).

We also see each of the variables listed out by name, followed by the data type <datatype>, and then a look at the first few rows of each variable.

glimpse(penguins)

## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

head()

Before reading further (or running the code) take a second to think about what the head() function might return.

If you guessed the “head” of the dataframe, or the first few rows, you’d be correct! I use head() to check a couple of things. First, I want to see if my data imported correctly. It’s not uncommon to have the first few rows of a .csv file be blank, or contain information that I don’t want in my final dataset. Second, head() prints out a nicely-formatted table that lets me take a quick look and see if the data is formatted consistently.

Using head() and seeing that your data is formatted consistently isn’t a guarantee that you won’t run into problems later, but it’s a great first check.

head(penguins)

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

summary()

summary() might be one of the first functions I remember using and going “ooooh, this is pretty cool!” Like with the head() function, the name tells you what it does - any data that we pass to summary() will return a set of summary statistics appropriate for that datatype.

We can send individual variables to summary(), or an entire dataframe, and get a quick idea of our data types, the spread of our data, and an idea of how much missing missing data we’ll be dealing with.

summary(penguins)

##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

a note on names()

I have a really hard time remembering what the names of my variables are, and because R is case-sensitive, how the names are formatted. We could fix this by converting all of our variable names to the same case, but for now just know that if you ever need a refresher on the names of the variables in your dataset (and how they’re formatted!) you can run names(), like this:

names(penguins)

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

Exploring our data with dplyr

Main functions we’ll use

arrange()
filter()
select()
mutate()
summarise() (you can also use summarize())

Reading and writing R code

One thing that I really enjoy about working in R is that I can write out what I want to do in a sentence, and then translate that into code. For example, if I say:

Take the penguins dataset and then filter for all penguins that live on Torgersen island

Take the penguins dataset translates to penguins
and then translates to %>%
filter for all penguins that live on Torgersen island translates to filter(island == "Torgersen")

We can then take these three lines and put them together to get the following:

penguins %>%
  filter(island == "Torgersen")

## # A tibble: 52 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 42 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Wait what the heck is %>%?

%>% is the pipe operator, and it allows us to push our data through sequential functions in R. Much like we use the words “and then” to describe instructions or steps on how to do something, %>% acts like an “and then” statement between functions.

We can take the code we wrote above and then add a function we’ve already used, head() to print out a much shorter table, like this:

penguins %>%
  filter(island == "Torgersen") %>%
  head()

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

So let’s get to it! In this section we’ll go through a couple of examples with each of the individual dplyr functions, and then start combining them to do some powerful data manipulations!

Applying arrange()

arrange() “arranges,” or organizes, our data in ascending order, starting from the lowest value and running to the highest (or in the case of character data, in alphabetical order).

We can provide a single argument to the arrange() function, such as bill_length_mm (double) or species (character).

# numeric data 
# I've added the head() function to the end of the function chain to reduce the length of the table that's printed out
# you can remove it in your version!

penguins %>%
  arrange(bill_length_mm) %>%
  head()

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Dream               32.1          15.5               188        3050
## 2 Adelie  Dream               33.1          16.1               178        2900
## 3 Adelie  Torgersen           33.5          19                 190        3600
## 4 Adelie  Dream               34            17.1               185        3400
## 5 Adelie  Torgersen           34.1          18.1               193        3475
## 6 Adelie  Torgersen           34.4          18.4               184        3325
## # ℹ 2 more variables: sex <fct>, year <int>

# character data

penguins %>%
  arrange(species)

## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Creating a subset

It’s a little hard to see what’s going on in the above table, so I’m going to create a smaller subset of the penguins dataset so that we can see what’s going on a bit more clearly. You can run the code on the subset of the data, or replace penguins_subset with penguins to see what happens on the full dataset!

# creating a random subset of the penguins dataset
set.seed(406)

penguins_subset <- penguins %>%
  sample_n(12)  # another dplyr function!

penguins_subset

## # A tibble: 12 × 8
##    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
##  1 Adelie    Torgers…           41.4          18.5               202        3875
##  2 Gentoo    Biscoe             45.5          13.9               210        4200
##  3 Gentoo    Biscoe             43.5          15.2               213        4650
##  4 Gentoo    Biscoe             50.5          15.9               225        5400
##  5 Gentoo    Biscoe             45.8          14.2               219        4700
##  6 Chinstrap Dream              49.3          19.9               203        4050
##  7 Adelie    Biscoe             40.5          17.9               187        3200
##  8 Chinstrap Dream              45.2          16.6               191        3250
##  9 Adelie    Dream              36.3          19.5               190        3800
## 10 Adelie    Torgers…           39            17.1               191        3050
## 11 Adelie    Biscoe             41.6          18                 192        3950
## 12 Gentoo    Biscoe             48.2          15.6               221        5100
## # ℹ 2 more variables: sex <fct>, year <int>

# let's re-run the arrange() function on character data in the penguins_subset data

penguins_subset %>%
  arrange(species)

## # A tibble: 12 × 8
##    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
##  1 Adelie    Torgers…           41.4          18.5               202        3875
##  2 Adelie    Biscoe             40.5          17.9               187        3200
##  3 Adelie    Dream              36.3          19.5               190        3800
##  4 Adelie    Torgers…           39            17.1               191        3050
##  5 Adelie    Biscoe             41.6          18                 192        3950
##  6 Chinstrap Dream              49.3          19.9               203        4050
##  7 Chinstrap Dream              45.2          16.6               191        3250
##  8 Gentoo    Biscoe             45.5          13.9               210        4200
##  9 Gentoo    Biscoe             43.5          15.2               213        4650
## 10 Gentoo    Biscoe             50.5          15.9               225        5400
## 11 Gentoo    Biscoe             45.8          14.2               219        4700
## 12 Gentoo    Biscoe             48.2          15.6               221        5100
## # ℹ 2 more variables: sex <fct>, year <int>

Nesting desc() inside arrange()

What if we don’t want our data in ascending order? Then we can nest the desc() function, which stands for descending, within the arrange() function. This will then order our numeric data from highest to lowest, and our character data in reverse alphabetical order.

# numeric data arranged in descending order

penguins_subset %>%
  arrange(desc(bill_length_mm))

## # A tibble: 12 × 8
##    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
##  1 Gentoo    Biscoe             50.5          15.9               225        5400
##  2 Chinstrap Dream              49.3          19.9               203        4050
##  3 Gentoo    Biscoe             48.2          15.6               221        5100
##  4 Gentoo    Biscoe             45.8          14.2               219        4700
##  5 Gentoo    Biscoe             45.5          13.9               210        4200
##  6 Chinstrap Dream              45.2          16.6               191        3250
##  7 Gentoo    Biscoe             43.5          15.2               213        4650
##  8 Adelie    Biscoe             41.6          18                 192        3950
##  9 Adelie    Torgers…           41.4          18.5               202        3875
## 10 Adelie    Biscoe             40.5          17.9               187        3200
## 11 Adelie    Torgers…           39            17.1               191        3050
## 12 Adelie    Dream              36.3          19.5               190        3800
## # ℹ 2 more variables: sex <fct>, year <int>

# character data arranged in descending - reverse alphabetical - order

penguins_subset %>%
  arrange(desc(species))

## # A tibble: 12 × 8
##    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
##  1 Gentoo    Biscoe             45.5          13.9               210        4200
##  2 Gentoo    Biscoe             43.5          15.2               213        4650
##  3 Gentoo    Biscoe             50.5          15.9               225        5400
##  4 Gentoo    Biscoe             45.8          14.2               219        4700
##  5 Gentoo    Biscoe             48.2          15.6               221        5100
##  6 Chinstrap Dream              49.3          19.9               203        4050
##  7 Chinstrap Dream              45.2          16.6               191        3250
##  8 Adelie    Torgers…           41.4          18.5               202        3875
##  9 Adelie    Biscoe             40.5          17.9               187        3200
## 10 Adelie    Dream              36.3          19.5               190        3800
## 11 Adelie    Torgers…           39            17.1               191        3050
## 12 Adelie    Biscoe             41.6          18                 192        3950
## # ℹ 2 more variables: sex <fct>, year <int>

Fun with filter()

filter() is probably one of my most used functions, because it allows me to look at subsets quickly and easily. What’s nice about filter() is its flexibility - we can use it on a single condition or multiple conditions.

# filter with a single numeric condition

penguins_subset %>%
  filter(bill_depth_mm > 16.2)

## # A tibble: 7 × 8
##   species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie    Torgersen           41.4          18.5               202        3875
## 2 Chinstrap Dream               49.3          19.9               203        4050
## 3 Adelie    Biscoe              40.5          17.9               187        3200
## 4 Chinstrap Dream               45.2          16.6               191        3250
## 5 Adelie    Dream               36.3          19.5               190        3800
## 6 Adelie    Torgersen           39            17.1               191        3050
## 7 Adelie    Biscoe              41.6          18                 192        3950
## # ℹ 2 more variables: sex <fct>, year <int>

# filter with a single character condition

penguins_subset %>%
  filter(island == "Dream")

## # A tibble: 3 × 8
##   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
## 1 Chinstrap Dream            49.3          19.9               203        4050
## 2 Chinstrap Dream            45.2          16.6               191        3250
## 3 Adelie    Dream            36.3          19.5               190        3800
## # ℹ 2 more variables: sex <fct>, year <int>

# filter with a single numeric condition between two values

penguins_subset %>%
  filter(between(bill_depth_mm, 16.2, 18.1 ))

## # A tibble: 4 × 8
##   species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie    Biscoe              40.5          17.9               187        3200
## 2 Chinstrap Dream               45.2          16.6               191        3250
## 3 Adelie    Torgersen           39            17.1               191        3050
## 4 Adelie    Biscoe              41.6          18                 192        3950
## # ℹ 2 more variables: sex <fct>, year <int>

Starting with select()

select() allows us to pick which columns (variables) we want to look at, and we can use it to pull a subset of variables, or even rearrange the order of our variables within a dataframe.

# selecting species, flipper_length_mm, and sex columns

penguins_subset %>%
  select(species, flipper_length_mm, sex)

## # A tibble: 12 × 3
##    species   flipper_length_mm sex   
##    <fct>                 <int> <fct> 
##  1 Adelie                  202 male  
##  2 Gentoo                  210 female
##  3 Gentoo                  213 female
##  4 Gentoo                  225 male  
##  5 Gentoo                  219 female
##  6 Chinstrap               203 male  
##  7 Adelie                  187 female
##  8 Chinstrap               191 female
##  9 Adelie                  190 male  
## 10 Adelie                  191 female
## 11 Adelie                  192 male  
## 12 Gentoo                  221 male

# selecting all character data

penguins_subset %>%
  select(where(is.character))

## # A tibble: 12 × 0

# selecting all numeric data

penguins_subset %>%
  select(where(is.numeric))

## # A tibble: 12 × 5
##    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##             <dbl>         <dbl>             <int>       <int> <int>
##  1           41.4          18.5               202        3875  2009
##  2           45.5          13.9               210        4200  2008
##  3           43.5          15.2               213        4650  2009
##  4           50.5          15.9               225        5400  2008
##  5           45.8          14.2               219        4700  2008
##  6           49.3          19.9               203        4050  2009
##  7           40.5          17.9               187        3200  2007
##  8           45.2          16.6               191        3250  2009
##  9           36.3          19.5               190        3800  2008
## 10           39            17.1               191        3050  2009
## 11           41.6          18                 192        3950  2008
## 12           48.2          15.6               221        5100  2008

# selecting all character data by using "where not numeric" data

penguins_subset %>%
  select(!where(is.numeric))

## # A tibble: 12 × 3
##    species   island    sex   
##    <fct>     <fct>     <fct> 
##  1 Adelie    Torgersen male  
##  2 Gentoo    Biscoe    female
##  3 Gentoo    Biscoe    female
##  4 Gentoo    Biscoe    male  
##  5 Gentoo    Biscoe    female
##  6 Chinstrap Dream     male  
##  7 Adelie    Biscoe    female
##  8 Chinstrap Dream     female
##  9 Adelie    Dream     male  
## 10 Adelie    Torgersen female
## 11 Adelie    Biscoe    male  
## 12 Gentoo    Biscoe    male

Math with mutate()

What’s not to love about a function that let’s us create new columns (variables)?! For these examples we’ll work strictly with mutate(), but when you work on extending this notebook, try using group_by() and then mutate()! (We’ll talk about group_by() in the next section.)

# converting grams to pounds
# notice how the order of our columns stays the same, and the new column, body_weight_pounds, gets placed at the 
# far right of the dataframe. what function could we use to change this order?

penguins_subset %>%
  mutate(body_weight_pounds = body_mass_g / 453.59237)

## # A tibble: 12 × 9
##    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
##  1 Adelie    Torgers…           41.4          18.5               202        3875
##  2 Gentoo    Biscoe             45.5          13.9               210        4200
##  3 Gentoo    Biscoe             43.5          15.2               213        4650
##  4 Gentoo    Biscoe             50.5          15.9               225        5400
##  5 Gentoo    Biscoe             45.8          14.2               219        4700
##  6 Chinstrap Dream              49.3          19.9               203        4050
##  7 Adelie    Biscoe             40.5          17.9               187        3200
##  8 Chinstrap Dream              45.2          16.6               191        3250
##  9 Adelie    Dream              36.3          19.5               190        3800
## 10 Adelie    Torgers…           39            17.1               191        3050
## 11 Adelie    Biscoe             41.6          18                 192        3950
## 12 Gentoo    Biscoe             48.2          15.6               221        5100
## # ℹ 3 more variables: sex <fct>, year <int>, body_weight_pounds <dbl>

# OK I wanted to show you how to combine select and mutate
# what do you think the everything() function might do? confirm your guess by looking at the documentation (linked at 
# the end of the notebook).

penguins_subset %>%
  mutate(body_weight_pounds = body_mass_g / 453.59237) %>%
  select(species, body_mass_g, body_weight_pounds, everything())

## # A tibble: 12 × 9
##    species   body_mass_g body_weight_pounds island  bill_length_mm bill_depth_mm
##    <fct>           <int>              <dbl> <fct>            <dbl>         <dbl>
##  1 Adelie           3875               8.54 Torger…           41.4          18.5
##  2 Gentoo           4200               9.26 Biscoe            45.5          13.9
##  3 Gentoo           4650              10.3  Biscoe            43.5          15.2
##  4 Gentoo           5400              11.9  Biscoe            50.5          15.9
##  5 Gentoo           4700              10.4  Biscoe            45.8          14.2
##  6 Chinstrap        4050               8.93 Dream             49.3          19.9
##  7 Adelie           3200               7.05 Biscoe            40.5          17.9
##  8 Chinstrap        3250               7.17 Dream             45.2          16.6
##  9 Adelie           3800               8.38 Dream             36.3          19.5
## 10 Adelie           3050               6.72 Torger…           39            17.1
## 11 Adelie           3950               8.71 Biscoe            41.6          18  
## 12 Gentoo           5100              11.2  Biscoe            48.2          15.6
## # ℹ 3 more variables: flipper_length_mm <int>, sex <fct>, year <int>

Summaries with summarise(), with help from group_by()

You can use either summarise() or summarize() to get a summary, or overview, of your data. What’s more, once we introduce group_by() you can get summary data for subsets of your data.

# summarising the average body mass of penguins, in grams

penguins_subset %>%
  summarise(avg_body_mass = mean(body_mass_g))

## # A tibble: 1 × 1
##   avg_body_mass
##           <dbl>
## 1         4102.

# since we're now summarising our data we can go ahead and use the full dataframe, since the printout will be reasonably-sized

penguins %>%
  summarise(avg_body_mass = mean(body_mass_g))

## # A tibble: 1 × 1
##   avg_body_mass
##           <dbl>
## 1            NA

The NAs!

If we don’t handle our NA values we’re going to be in for a bad time. There are multiple ways of dealing with NA values in R - for now we’re going to use na.rm = TRUE, but you could use filter() from the {dplyr} package or drop_na() from the {tidyr} package as well!

na.rm is like asking the question, “Should we remove NAs from our code?” where na stands for NA values, and rm stands for remove. So when we set na.rm = TRUE we’re saying “Yes, please remove NA values from my calculations.” Likewise if we use na.rm = FALSE we’re telling R that we want to include NA values in our calculations.

And if you’re not sure, NA stands for “Not Available,” meaning data that is missing.

# summarising body mass on the entire penguins dataset while removing NA values from the calculation

penguins %>%
  summarise(avg_body_mass = mean(body_mass_g, na.rm = TRUE))

## # A tibble: 1 × 1
##   avg_body_mass
##           <dbl>
## 1         4202.

# now let's use the grouping function, group_by(), to look at the average body mass of penguins, in grams,
# by species

penguins %>%
  group_by(species) %>%
  summarise(avg_species_body_mass = mean(body_mass_g, na.rm = TRUE))

## # A tibble: 3 × 2
##   species   avg_species_body_mass
##   <fct>                     <dbl>
## 1 Adelie                    3701.
## 2 Chinstrap                 3733.
## 3 Gentoo                    5076.

# now let's calculate the average body mass by species AND island

penguins %>%
  group_by(species, island) %>%
  summarise(avg_species_body_mass = mean(body_mass_g, na.rm = TRUE))

## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.

## # A tibble: 5 × 3
## # Groups:   species [3]
##   species   island    avg_species_body_mass
##   <fct>     <fct>                     <dbl>
## 1 Adelie    Biscoe                    3710.
## 2 Adelie    Dream                     3688.
## 3 Adelie    Torgersen                 3706.
## 4 Chinstrap Dream                     3733.
## 5 Gentoo    Biscoe                    5076.

Where to next?

What we’ve done here only scratches the surface of what you can accomplish with {dplyr}. {dplyr} is a powerful package in its own right, but even more so once you dive into column-wise operations, like across(), as well as combine it with other packages in the Tidyverse, such as {purrr} and {ggplot2}.

What I recommend is making a copy of this notebook and running the cells to ensure you understand what’s happening with each function, and then build out additional chains of {dplyr} functions to see what you can discover and learn! Play around and don’t worry about making mistakes - it’s all part of learning!

These are some helpful resources to get you started:

{dplyr} documentation - so many functions!
R for Data Science text
STAT545
More on column-wise operations

Posted on:: May 1, 2020

Length:: 21 minute read, 4310 words

Categories:: technical

Tags:: kaggle rstats dplyr

See Also:: Truss + XGBoost for Rapid Model Deployment; TPUs + Cassava Leaf Disease; {tidymodels} and XGBoost