library(tidyverse)
library(nycflights23)
Data wrangling with dplyr
Required packages
All packages are automatically loaded in this tutorial.
Tibble
vs. data frame
The tidyverse provides an improved data structure (improvement over ordinary data frames), called tibble
:
Better performance (column layout).
More “type safety” and less magic.
Better support for common interactive tasks.
Data types
Print a tibble
Print a data frame
Print both the entire df
and flights
(simply call df
and flights
) to see how the output differs.
dplyr
vs. base R
Slicing the flights
data set with base R operators.
The basic indexing/slicing operators []
, [[]]
, and $
are useful for simple tasks but may return unexpected results and quickly become hard to read.
Using dplyr
operations and pipes (|>
) for readability.
Note that the line breaks are not required; they only improve readability.
dplyr
functions
- Functions are verbs
select
filter
mutate
group_by
summarize
- etc.
- Consistently uses ’_’ as seperator (e.g.,
group_by
) - Non-standard evaluation (no need for quotes and qualifiers)
count
count
is a very convenient and concise way of counting the number of entries in different slices of a data set. If applied to a grouped tibble (see below), the arguments (grouping variables) can be omitted.
select
Select specific columns (include):
Select specific columns (exclude):
filter
Filter rows with predicates:
Filter rows with %in%
:
mutate
Add a column:
Change a column:
Depending on the overall goal mutate
or select
may result in a more concise statement. For example, if the goal is to change a single column but keep all columns in the data set, then using mutate
is likely more concise; if the goal is to select a few columns and also change some of them, then using only select
may suffice.
Change a column type:
rename
rename_all
relocate
Be default, relocate
moves columns to the left-hand side. The tidyverse provides a DSL with helper functions for selecting variables (columns). Using these helpers (e.g., everything()
, last_col()
, etc.) avoids long lists of column names and better expresses intent.
group_by
and n
Downstream operations are applied to each group, but group_by
still returns a single (grouped) tibble! In this case, n()
, which returns the number of rows, is applied to each group, but all groups are concatenated in the returned (grouped) tibble.
Note that the resulting tibble still contains all rows. A common use case for group_by
and n
is to add a normalization variable or provide additional information for plotting all data points. If we were only interested in unique entries, we would use count
or summarize
instead.
The type of a grouped tibble:
ungroup
Make sure to ungroup a tibble if it is intended to be used in its entirety later in your script. For example, it is common to group and aggregate data for plotting. Once all relevant information has been added to the tibble, calling ungroup
will make sure that any additional grouping in the plot has to be explicitly declared.
A new call to group_by
will change the grouping, so there is no need to call ungroup
first in a sequence of multiple data groupings.
summarize
The mean, median, etc. of a vector that includes NA values is always NA.
Use na.rm=T
(or na.rm=TRUE
) to drop NA values.
We will discuss this and other approaches to handling NA values in detail later.
arrange
Order data by column:
arrange
orders in ascending order by default
Use arrange(desc(n_flights))
to order in descending order.
left_join
Questions
Q1 State three questions/hypotheses that you can answer/test with the
nycflights23
data set. Answering each question should require some data-slicing operations. (For example, answering “how many flights are there in total” or “what is the average arrival delay across all flights” is not sufficient.)Q2 Answer each of your three questions by analyzing the
nycflights23
data set. Focus on the followingdplyr
functions:count
,select
,filter
,group_by
, andsummarize
. You may use the interactive code block below or a set up of your choice.Q3 (Optional) Provide a ggplot2 visualization for at least two of your questions.
Q4 (Optional, going deep) Explore the difference in semantics for the two pipes
|>
and%>%
. (Note that|>
is the pipe natively supported in modern versions of R. The%>%
pipe is provided by themagrittr
package and predates|>
. By default, you should use the native pipe|>
, but you will encounter many resources that still use%>%
.)