Data wrangling with dplyr

Required packages

Note

All packages are automatically loaded in this tutorial.

library(tidyverse)
library(nycflights23)

Tibble vs. data frame

The tidyverse provides an improved data structure (improvement over ordinary data frames), called tibble:

  • Better performance (column layout).

  • More “type safety” and less magic.

  • Better support for common interactive tasks.

Data types

dplyr vs. base R

Slicing the flights data set with base R operators.

R’s indexing/slicing operators

The basic indexing/slicing operators [], [[]], and $ are useful for simple tasks but may return unexpected results and quickly become hard to read.

Using dplyr operations and pipes (|>) for readability.

Optimize for readability: code is written once but read many times!

Note that the line breaks are not required; they only improve readability.

dplyr functions

  • Functions are verbs
    • select
    • filter
    • mutate
    • group_by
    • summarize
    • etc.
  • Consistently uses ’_’ as seperator (e.g., group_by)
  • Non-standard evaluation (no need for quotes and qualifiers)

count

Note

count is a very convenient and concise way of counting the number of entries in different slices of a data set. If applied to a grouped tibble (see below), the arguments (grouping variables) can be omitted.

select

Select specific columns (include):

Select specific columns (exclude):

filter

Filter rows with predicates:

Filter rows with %in%:

mutate

Add a column:

Change a column:

Note

Depending on the overall goal mutate or select may result in a more concise statement. For example, if the goal is to change a single column but keep all columns in the data set, then using mutate is likely more concise; if the goal is to select a few columns and also change some of them, then using only select may suffice.

Change a column type:

rename

rename_all

relocate

Selection helpers

Be default, relocate moves columns to the left-hand side. The tidyverse provides a DSL with helper functions for selecting variables (columns). Using these helpers (e.g., everything(), last_col(), etc.) avoids long lists of column names and better expresses intent.

group_by and n

A grouped tibble affects downstream operations

Downstream operations are applied to each group, but group_by still returns a single (grouped) tibble! In this case, n(), which returns the number of rows, is applied to each group, but all groups are concatenated in the returned (grouped) tibble.

Note that the resulting tibble still contains all rows. A common use case for group_by and n is to add a normalization variable or provide additional information for plotting all data points. If we were only interested in unique entries, we would use count or summarize instead.

The type of a grouped tibble:

ungroup

Note

Make sure to ungroup a tibble if it is intended to be used in its entirety later in your script. For example, it is common to group and aggregate data for plotting. Once all relevant information has been added to the tibble, calling ungroup will make sure that any additional grouping in the plot has to be explicitly declared.

A new call to group_by will change the grouping, so there is no need to call ungroup first in a sequence of multiple data groupings.

summarize

Handling of NA values

The mean, median, etc. of a vector that includes NA values is always NA.

Use na.rm=T (or na.rm=TRUE) to drop NA values.

We will discuss this and other approaches to handling NA values in detail later.

arrange

Order data by column:

arrange orders in ascending order by default

Use arrange(desc(n_flights)) to order in descending order.

left_join

Questions

  • Q1 State three questions/hypotheses that you can answer/test with the nycflights23 data set. Answering each question should require some data-slicing operations. (For example, answering “how many flights are there in total” or “what is the average arrival delay across all flights” is not sufficient.)

  • Q2 Answer each of your three questions by analyzing the nycflights23 data set. Focus on the following dplyr functions: count, select, filter, group_by, and summarize. You may use the interactive code block below or a set up of your choice.

  • Q3 (Optional) Provide a ggplot2 visualization for at least two of your questions.

  • Q4 (Optional, going deep) Explore the difference in semantics for the two pipes |> and %>%. (Note that |> is the pipe natively supported in modern versions of R. The %>% pipe is provided by the magrittr package and predates |>. By default, you should use the native pipe |>, but you will encounter many resources that still use %>%.)

Interactive environment for Q2