Dive into {dplyr}
By Jesse Mostipak in technical
May 1, 2020
Note: this post was originally published as a Kaggle notebook
Introduction: why {dplyr}?
There are a lot of amazing packages in the
Tidyverse, but {dplyr}
is hands-down my absolute favorite package. I use {dplyr}
when I’m cleaning and exploring my dataset, and what I particularly love is that after I get a good handle on my dataset with {dplyr}
, I can feed the various manipulations I’ve creating into the {ggplot2}
package for visualization.
This tutorial is for anyone interested in learning the basics of the {dplyr}
package. We’ll be focusing on data exploration and manipulation, building off of the examples in the {dplyr}
package documentation using the
Palmer Penguins dataset.
By the end of this notebook, you’ll be able to:
- Demonstrate what each of the main five
dplyr
functions does - Use the pipe operator
%>%
to chain together multipledplyr
functions
My analytical workflow
We won’t be covering all of the steps in my workflow in this tutorial, but in general I follow these steps:
- Set up the programming environment by loading packages
- Import my data
- Check out my data
- Explore my data
- Model my data
- Communicate what I’ve learned
Set up our environment
# we have a couple of options here - we can load the entire tidyverse or we can just load the
# tidyverse packages that we're interested in using. I'm going to load the tidyverse, but alternatively you
# could run the following instead:
#library(readr)
#library(dplyr)
# load the tidyverse
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ stringr 1.5.0
## ✔ tidyr 1.3.0 ✔ forcats 0.5.2
## ✔ readr 2.1.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
A quick note on Conflicts
After running library(tidyverse)
you might have noticed that the print out told us which packages were attached successfully (all of them, as evidenced by the green check marks), and where we have conflicts (the red x’s).
Conflicts aren’t necessarily a bad thing! Because R is an open source language and anyone can create a package, it’s common for different packages to use the same name for similar functions. In our conflicts we see that the filter()
function from the {dplyr}
package masks the filter()
function from the {stats}
package. We know this because the package name comes before the double colon and the function name comes after, like this:
package::function()
What if we want to use the filter()
function from the {stats}
package? All is not lost! What we can do in our code is use the full package::function()
syntax and R will know to use the filter()
function from the {stats}
package instead of the {dplyr}
package.
Import our data
penguins <- read_csv('../input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv')
Parsing the parsing statement
One thing that took me awhile to get used to was that just because the text is in BRIGHT RED doesn’t mean that something bad has happened, or that I’ve made a mistake. And that’s the same as what we’re seeing here!
What the parsing statement does is tell us how R formatted each of the columns in our dataframe. The read_csv()
function looks at the first thousand rows of a dataset and makes an educated guess as to what the remaining rows are. We can override this if we need to, either by telling R to use more rows to guess using the guess_max
argument, or by explicitly telling R what type of data is in each column.
Check out our data
Here’s where I like to get a handle on what I’m working with. I’ll use various functions to make sure my data imported correctly, and start to get an understanding of the data structure and data types. The functions I commonly use to accomplish this are:
glimpse()
head()
andtail()
summary()
glimpse()
is grrrreat!
glimpse()
gives you just about everything you could want, all wrapped up in a single function. We get our dataframe structure with the printout to rows and columns, telling us that in our penguins
dataset we have 344 rows (or observations) and 7 columns (or variables).
We also see each of the variables listed out by name, followed by the data type <datatype>
, and then a look at the first few rows of each variable.
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
head()
Before reading further (or running the code) take a second to think about what the head()
function might return.
If you guessed the “head” of the dataframe, or the first few rows, you’d be correct! I use head()
to check a couple of things. First, I want to see if my data imported correctly. It’s not uncommon to have the first few rows of a .csv
file be blank, or contain information that I don’t want in my final dataset. Second, head()
prints out a nicely-formatted table that lets me take a quick look and see if the data is formatted consistently.
Using head()
and seeing that your data is formatted consistently isn’t a guarantee that you won’t run into problems later, but it’s a great first check.
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
summary()
summary()
might be one of the first functions I remember using and going “ooooh, this is pretty cool!” Like with the head()
function, the name tells you what it does - any data that we pass to summary()
will return a set of summary statistics appropriate for that datatype.
We can send individual variables to summary()
, or an entire dataframe, and get a quick idea of our data types, the spread of our data, and an idea of how much missing missing data we’ll be dealing with.
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
a note on names()
I have a really hard time remembering what the names of my variables are, and because R is case-sensitive, how the names are formatted. We could fix this by converting all of our variable names to the same case, but for now just know that if you ever need a refresher on the names of the variables in your dataset (and how they’re formatted!) you can run names()
, like this:
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
Exploring our data with dplyr
Main functions we’ll use
arrange()
filter()
select()
mutate()
summarise()
(you can also usesummarize()
)
Reading and writing R code
One thing that I really enjoy about working in R is that I can write out what I want to do in a sentence, and then translate that into code. For example, if I say:
Take the penguins dataset and then filter for all penguins that live on Torgersen island
- Take the penguins dataset translates to
penguins
- and then translates to
%>%
- filter for all penguins that live on Torgersen island translates to
filter(island == "Torgersen")
We can then take these three lines and put them together to get the following:
penguins %>%
filter(island == "Torgersen")
## # A tibble: 52 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 42 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Wait what the heck is %>%?
%>%
is the pipe operator, and it allows us to push our data through sequential functions in R. Much like we use the words “and then” to describe instructions or steps on how to do something, %>%
acts like an “and then” statement between functions.
We can take the code we wrote above and then add a function we’ve already used, head()
to print out a much shorter table, like this:
penguins %>%
filter(island == "Torgersen") %>%
head()
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
So let’s get to it! In this section we’ll go through a couple of examples with each of the individual dplyr functions, and then start combining them to do some powerful data manipulations!
Applying arrange()
arrange()
“arranges,” or organizes, our data in ascending order, starting from the lowest value and running to the highest (or in the case of character data, in alphabetical order).
We can provide a single argument to the arrange()
function, such as bill_length_mm (double)
or species
(character).
# numeric data
# I've added the head() function to the end of the function chain to reduce the length of the table that's printed out
# you can remove it in your version!
penguins %>%
arrange(bill_length_mm) %>%
head()
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Dream 32.1 15.5 188 3050
## 2 Adelie Dream 33.1 16.1 178 2900
## 3 Adelie Torgersen 33.5 19 190 3600
## 4 Adelie Dream 34 17.1 185 3400
## 5 Adelie Torgersen 34.1 18.1 193 3475
## 6 Adelie Torgersen 34.4 18.4 184 3325
## # ℹ 2 more variables: sex <fct>, year <int>
# character data
penguins %>%
arrange(species)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Creating a subset
It’s a little hard to see what’s going on in the above table, so I’m going to create a smaller subset of the penguins dataset so that we can see what’s going on a bit more clearly. You can run the code on the subset of the data, or replace penguins_subset with penguins to see what happens on the full dataset!
# creating a random subset of the penguins dataset
set.seed(406)
penguins_subset <- penguins %>%
sample_n(12) # another dplyr function!
penguins_subset
## # A tibble: 12 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgers… 41.4 18.5 202 3875
## 2 Gentoo Biscoe 45.5 13.9 210 4200
## 3 Gentoo Biscoe 43.5 15.2 213 4650
## 4 Gentoo Biscoe 50.5 15.9 225 5400
## 5 Gentoo Biscoe 45.8 14.2 219 4700
## 6 Chinstrap Dream 49.3 19.9 203 4050
## 7 Adelie Biscoe 40.5 17.9 187 3200
## 8 Chinstrap Dream 45.2 16.6 191 3250
## 9 Adelie Dream 36.3 19.5 190 3800
## 10 Adelie Torgers… 39 17.1 191 3050
## 11 Adelie Biscoe 41.6 18 192 3950
## 12 Gentoo Biscoe 48.2 15.6 221 5100
## # ℹ 2 more variables: sex <fct>, year <int>
# let's re-run the arrange() function on character data in the penguins_subset data
penguins_subset %>%
arrange(species)
## # A tibble: 12 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgers… 41.4 18.5 202 3875
## 2 Adelie Biscoe 40.5 17.9 187 3200
## 3 Adelie Dream 36.3 19.5 190 3800
## 4 Adelie Torgers… 39 17.1 191 3050
## 5 Adelie Biscoe 41.6 18 192 3950
## 6 Chinstrap Dream 49.3 19.9 203 4050
## 7 Chinstrap Dream 45.2 16.6 191 3250
## 8 Gentoo Biscoe 45.5 13.9 210 4200
## 9 Gentoo Biscoe 43.5 15.2 213 4650
## 10 Gentoo Biscoe 50.5 15.9 225 5400
## 11 Gentoo Biscoe 45.8 14.2 219 4700
## 12 Gentoo Biscoe 48.2 15.6 221 5100
## # ℹ 2 more variables: sex <fct>, year <int>
Nesting desc() inside arrange()
What if we don’t want our data in ascending order? Then we can nest the desc()
function, which stands for descending, within the arrange()
function. This will then order our numeric data from highest to lowest, and our character data in reverse alphabetical order.
# numeric data arranged in descending order
penguins_subset %>%
arrange(desc(bill_length_mm))
## # A tibble: 12 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 50.5 15.9 225 5400
## 2 Chinstrap Dream 49.3 19.9 203 4050
## 3 Gentoo Biscoe 48.2 15.6 221 5100
## 4 Gentoo Biscoe 45.8 14.2 219 4700
## 5 Gentoo Biscoe 45.5 13.9 210 4200
## 6 Chinstrap Dream 45.2 16.6 191 3250
## 7 Gentoo Biscoe 43.5 15.2 213 4650
## 8 Adelie Biscoe 41.6 18 192 3950
## 9 Adelie Torgers… 41.4 18.5 202 3875
## 10 Adelie Biscoe 40.5 17.9 187 3200
## 11 Adelie Torgers… 39 17.1 191 3050
## 12 Adelie Dream 36.3 19.5 190 3800
## # ℹ 2 more variables: sex <fct>, year <int>
# character data arranged in descending - reverse alphabetical - order
penguins_subset %>%
arrange(desc(species))
## # A tibble: 12 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 45.5 13.9 210 4200
## 2 Gentoo Biscoe 43.5 15.2 213 4650
## 3 Gentoo Biscoe 50.5 15.9 225 5400
## 4 Gentoo Biscoe 45.8 14.2 219 4700
## 5 Gentoo Biscoe 48.2 15.6 221 5100
## 6 Chinstrap Dream 49.3 19.9 203 4050
## 7 Chinstrap Dream 45.2 16.6 191 3250
## 8 Adelie Torgers… 41.4 18.5 202 3875
## 9 Adelie Biscoe 40.5 17.9 187 3200
## 10 Adelie Dream 36.3 19.5 190 3800
## 11 Adelie Torgers… 39 17.1 191 3050
## 12 Adelie Biscoe 41.6 18 192 3950
## # ℹ 2 more variables: sex <fct>, year <int>
Fun with filter()
filter()
is probably one of my most used functions, because it allows me to look at subsets quickly and easily. What’s nice about filter()
is its flexibility - we can use it on a single condition or multiple conditions.
# filter with a single numeric condition
penguins_subset %>%
filter(bill_depth_mm > 16.2)
## # A tibble: 7 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 41.4 18.5 202 3875
## 2 Chinstrap Dream 49.3 19.9 203 4050
## 3 Adelie Biscoe 40.5 17.9 187 3200
## 4 Chinstrap Dream 45.2 16.6 191 3250
## 5 Adelie Dream 36.3 19.5 190 3800
## 6 Adelie Torgersen 39 17.1 191 3050
## 7 Adelie Biscoe 41.6 18 192 3950
## # ℹ 2 more variables: sex <fct>, year <int>
# filter with a single character condition
penguins_subset %>%
filter(island == "Dream")
## # A tibble: 3 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Chinstrap Dream 49.3 19.9 203 4050
## 2 Chinstrap Dream 45.2 16.6 191 3250
## 3 Adelie Dream 36.3 19.5 190 3800
## # ℹ 2 more variables: sex <fct>, year <int>
# filter with a single numeric condition between two values
penguins_subset %>%
filter(between(bill_depth_mm, 16.2, 18.1 ))
## # A tibble: 4 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Biscoe 40.5 17.9 187 3200
## 2 Chinstrap Dream 45.2 16.6 191 3250
## 3 Adelie Torgersen 39 17.1 191 3050
## 4 Adelie Biscoe 41.6 18 192 3950
## # ℹ 2 more variables: sex <fct>, year <int>
Starting with select()
select()
allows us to pick which columns (variables) we want to look at, and we can use it to pull a subset of variables, or even rearrange the order of our variables within a dataframe.
# selecting species, flipper_length_mm, and sex columns
penguins_subset %>%
select(species, flipper_length_mm, sex)
## # A tibble: 12 × 3
## species flipper_length_mm sex
## <fct> <int> <fct>
## 1 Adelie 202 male
## 2 Gentoo 210 female
## 3 Gentoo 213 female
## 4 Gentoo 225 male
## 5 Gentoo 219 female
## 6 Chinstrap 203 male
## 7 Adelie 187 female
## 8 Chinstrap 191 female
## 9 Adelie 190 male
## 10 Adelie 191 female
## 11 Adelie 192 male
## 12 Gentoo 221 male
# selecting all character data
penguins_subset %>%
select(where(is.character))
## # A tibble: 12 × 0
# selecting all numeric data
penguins_subset %>%
select(where(is.numeric))
## # A tibble: 12 × 5
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## <dbl> <dbl> <int> <int> <int>
## 1 41.4 18.5 202 3875 2009
## 2 45.5 13.9 210 4200 2008
## 3 43.5 15.2 213 4650 2009
## 4 50.5 15.9 225 5400 2008
## 5 45.8 14.2 219 4700 2008
## 6 49.3 19.9 203 4050 2009
## 7 40.5 17.9 187 3200 2007
## 8 45.2 16.6 191 3250 2009
## 9 36.3 19.5 190 3800 2008
## 10 39 17.1 191 3050 2009
## 11 41.6 18 192 3950 2008
## 12 48.2 15.6 221 5100 2008
# selecting all character data by using "where not numeric" data
penguins_subset %>%
select(!where(is.numeric))
## # A tibble: 12 × 3
## species island sex
## <fct> <fct> <fct>
## 1 Adelie Torgersen male
## 2 Gentoo Biscoe female
## 3 Gentoo Biscoe female
## 4 Gentoo Biscoe male
## 5 Gentoo Biscoe female
## 6 Chinstrap Dream male
## 7 Adelie Biscoe female
## 8 Chinstrap Dream female
## 9 Adelie Dream male
## 10 Adelie Torgersen female
## 11 Adelie Biscoe male
## 12 Gentoo Biscoe male
Math with mutate()
What’s not to love about a function that let’s us create new columns (variables)?! For these examples we’ll work strictly with mutate()
, but when you work on extending this notebook, try using group_by()
and then mutate()
! (We’ll talk about group_by()
in the next section.)
# converting grams to pounds
# notice how the order of our columns stays the same, and the new column, body_weight_pounds, gets placed at the
# far right of the dataframe. what function could we use to change this order?
penguins_subset %>%
mutate(body_weight_pounds = body_mass_g / 453.59237)
## # A tibble: 12 × 9
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgers… 41.4 18.5 202 3875
## 2 Gentoo Biscoe 45.5 13.9 210 4200
## 3 Gentoo Biscoe 43.5 15.2 213 4650
## 4 Gentoo Biscoe 50.5 15.9 225 5400
## 5 Gentoo Biscoe 45.8 14.2 219 4700
## 6 Chinstrap Dream 49.3 19.9 203 4050
## 7 Adelie Biscoe 40.5 17.9 187 3200
## 8 Chinstrap Dream 45.2 16.6 191 3250
## 9 Adelie Dream 36.3 19.5 190 3800
## 10 Adelie Torgers… 39 17.1 191 3050
## 11 Adelie Biscoe 41.6 18 192 3950
## 12 Gentoo Biscoe 48.2 15.6 221 5100
## # ℹ 3 more variables: sex <fct>, year <int>, body_weight_pounds <dbl>
# OK I wanted to show you how to combine select and mutate
# what do you think the everything() function might do? confirm your guess by looking at the documentation (linked at
# the end of the notebook).
penguins_subset %>%
mutate(body_weight_pounds = body_mass_g / 453.59237) %>%
select(species, body_mass_g, body_weight_pounds, everything())
## # A tibble: 12 × 9
## species body_mass_g body_weight_pounds island bill_length_mm bill_depth_mm
## <fct> <int> <dbl> <fct> <dbl> <dbl>
## 1 Adelie 3875 8.54 Torger… 41.4 18.5
## 2 Gentoo 4200 9.26 Biscoe 45.5 13.9
## 3 Gentoo 4650 10.3 Biscoe 43.5 15.2
## 4 Gentoo 5400 11.9 Biscoe 50.5 15.9
## 5 Gentoo 4700 10.4 Biscoe 45.8 14.2
## 6 Chinstrap 4050 8.93 Dream 49.3 19.9
## 7 Adelie 3200 7.05 Biscoe 40.5 17.9
## 8 Chinstrap 3250 7.17 Dream 45.2 16.6
## 9 Adelie 3800 8.38 Dream 36.3 19.5
## 10 Adelie 3050 6.72 Torger… 39 17.1
## 11 Adelie 3950 8.71 Biscoe 41.6 18
## 12 Gentoo 5100 11.2 Biscoe 48.2 15.6
## # ℹ 3 more variables: flipper_length_mm <int>, sex <fct>, year <int>
Summaries with summarise(), with help from group_by()
You can use either summarise()
or summarize()
to get a summary, or overview, of your data. What’s more, once we introduce group_by()
you can get summary data for subsets of your data.
# summarising the average body mass of penguins, in grams
penguins_subset %>%
summarise(avg_body_mass = mean(body_mass_g))
## # A tibble: 1 × 1
## avg_body_mass
## <dbl>
## 1 4102.
# since we're now summarising our data we can go ahead and use the full dataframe, since the printout will be reasonably-sized
penguins %>%
summarise(avg_body_mass = mean(body_mass_g))
## # A tibble: 1 × 1
## avg_body_mass
## <dbl>
## 1 NA
The NAs!
If we don’t handle our NA
values we’re going to be in for a bad time. There are multiple ways of dealing with NA
values in R - for now we’re going to use na.rm = TRUE
, but you could use filter()
from the {dplyr}
package or drop_na()
from the {tidyr}
package as well!
na.rm
is like asking the question, “Should we remove NAs from our code?” where na
stands for NA
values, and rm
stands for remove. So when we set na.rm = TRUE
we’re saying “Yes, please remove NA values from my calculations.” Likewise if we use na.rm = FALSE
we’re telling R that we want to include NA
values in our calculations.
And if you’re not sure, NA
stands for “Not Available,” meaning data that is missing.
# summarising body mass on the entire penguins dataset while removing NA values from the calculation
penguins %>%
summarise(avg_body_mass = mean(body_mass_g, na.rm = TRUE))
## # A tibble: 1 × 1
## avg_body_mass
## <dbl>
## 1 4202.
# now let's use the grouping function, group_by(), to look at the average body mass of penguins, in grams,
# by species
penguins %>%
group_by(species) %>%
summarise(avg_species_body_mass = mean(body_mass_g, na.rm = TRUE))
## # A tibble: 3 × 2
## species avg_species_body_mass
## <fct> <dbl>
## 1 Adelie 3701.
## 2 Chinstrap 3733.
## 3 Gentoo 5076.
# now let's calculate the average body mass by species AND island
penguins %>%
group_by(species, island) %>%
summarise(avg_species_body_mass = mean(body_mass_g, na.rm = TRUE))
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 3
## # Groups: species [3]
## species island avg_species_body_mass
## <fct> <fct> <dbl>
## 1 Adelie Biscoe 3710.
## 2 Adelie Dream 3688.
## 3 Adelie Torgersen 3706.
## 4 Chinstrap Dream 3733.
## 5 Gentoo Biscoe 5076.
Where to next?
What we’ve done here only scratches the surface of what you can accomplish with {dplyr}
. {dplyr}
is a powerful package in its own right, but even more so once you dive into column-wise operations, like across()
, as well as combine it with other packages in the Tidyverse, such as {purrr}
and {ggplot2}
.
What I recommend is making a copy of this notebook and running the cells to ensure you understand what’s happening with each function, and then build out additional chains of {dplyr}
functions to see what you can discover and learn! Play around and don’t worry about making mistakes - it’s all part of learning!
These are some helpful resources to get you started:
-
{dplyr}
documentation - so many functions! - R for Data Science text
- STAT545
- More on column-wise operations