CSEP 590
Building Data Analysis Pipelines

Fall 2024

Data wrangling

Tibble vs. data frame

  • Better performance (column layout)
  • More “type safety” and less magic
  • Better support for common interactive tasks

mtcars is a built-in data set in R

Tibble vs. data frame

Tibbles printed to the console output only the first 10 rows

Print the entire df and tibble to see how they are rendered differently.

The tidyverse packages

  • Functions are verbs
    • select
    • filter
    • mutate
    • group_by
    • summarize
    • etc.
  • Consistently uses ’_’ as seperator (e.g., group_by)
  • Non-standard evaluation (no need for quotes and qualifiers)

Base R

Slicing the built-in mtcars data set.

R’s indexing operators

The basic indexing operators [], [[]], and $ are useful for simple tasks but may return unexpected results and quickly become hard to read.

Using dplyr functions

Explicit indexing operations:

head(select(filter(mtcars, am==1), mpg, cyl, gear), n=5)

With improved code formatting:

head(
  select(
    filter(
      mtcars,
      am==1
    ),
    mpg,
    cyl,
    gear
   ),
   n=5)

Why is this still hard to read?

Using dplyr functions

Explicit indexing operations:

head(select(filter(mtcars, am==1), mpg, cyl, gear), n=5)

With local variables to improve readability:

Using dplyr functions

Explicit indexing operations:

head(select(filter(mtcars, am==1), mpg, cyl, gear), n=5)

With local variables to improve readability:

Local variables improve readability, but this approach is prone to errors.

Using dplyr functions and pipes

Explicit indexing operations with pipes:

Optimize for readability: code is written once but read many times!

select

Select specific columns (include):

select

Select specific columns (exclude):

filter

Filter rows:

filter

Filter rows with %in%:

mutate

Add a column:

mutate

Change a column type:

mutate and str_replace

Change column values (replace with regex):

We use the stringr package here

See str_replace, str_replace_all, and str_replace_na.

mutate and str_replace_all

Change column values (replace with regex):

We use the stringr package here

Use c("<pattern>"="<new value>", ...) in str_replace_all.

rename_all

group_by

A grouped tibble affects downstream operations

Operations are applied to each group.

group_by and n

group_by, n, and ungroup

summarize

Handling of NA values

The mean, median, etc. of a vector that includes NA values is always NA.

Use na.rm=T (or na.rm=TRUE) to drop NA values.

arrange

Order data by column:

arrange orders in ascending order by default

Use arrange(desc(n_flights)) to order in descending order.

left_join