CSEP 590 Demos – CSEP 590 Building Data Analysis Pipelines

CSEP 590
Building Data Analysis Pipelines

Fall 2024

Data wrangling

`Tibble` vs. data frame

Better performance (column layout)
More “type safety” and less magic
Better support for common interactive tasks

mtcars is a built-in data set in R

`Tibble` vs. data frame

Tibbles printed to the console output only the first 10 rows

Print the entire df and tibble to see how they are rendered differently.

The `tidyverse` packages

Functions are verbs
- select
- filter
- mutate
- group_by
- summarize
- etc.
Consistently uses ’_’ as seperator (e.g., group_by)
Non-standard evaluation (no need for quotes and qualifiers)

Base R

Slicing the built-in mtcars data set.

R’s indexing operators

The basic indexing operators [], [[]], and $ are useful for simple tasks but may return unexpected results and quickly become hard to read.

Using `dplyr` functions

Explicit indexing operations:

head(select(filter(mtcars, am==1), mpg, cyl, gear), n=5)

With improved code formatting:

head(
  select(
    filter(
      mtcars,
      am==1
    ),
    mpg,
    cyl,
    gear
   ),
   n=5)

Why is this still hard to read?

Using `dplyr` functions

Explicit indexing operations:

head(select(filter(mtcars, am==1), mpg, cyl, gear), n=5)

With local variables to improve readability:

Using `dplyr` functions

Explicit indexing operations:

head(select(filter(mtcars, am==1), mpg, cyl, gear), n=5)

With local variables to improve readability:

Local variables improve readability, but this approach is prone to errors.

Using `dplyr` functions and pipes

Explicit indexing operations with pipes:

Optimize for readability: code is written once but read many times!

`select`

Select specific columns (include):

`select`

Select specific columns (exclude):

`filter`

Filter rows:

`filter`

Filter rows with %in%:

`mutate`

Add a column:

`mutate`

Change a column type:

`mutate` and `str_replace`

Change column values (replace with regex):

We use the stringr package here

See str_replace, str_replace_all, and str_replace_na.

`mutate` and `str_replace_all`

Change column values (replace with regex):

We use the stringr package here

Use c("<pattern>"="<new value>", ...) in str_replace_all.

`rename_all`

`group_by`

A grouped tibble affects downstream operations

Operations are applied to each group.

`group_by` and `n`

`group_by`, `n`, and `ungroup`

`summarize`

Handling of NA values

The mean, median, etc. of a vector that includes NA values is always NA.

Use na.rm=T (or na.rm=TRUE) to drop NA values.

`arrange`

Order data by column:

arrange orders in ascending order by default

Use arrange(desc(n_flights)) to order in descending order.

CSEP 590Building Data Analysis Pipelines

Tibble vs. data frame

Tibble vs. data frame

The tidyverse packages

Base R

Using dplyr functions

Using dplyr functions

Using dplyr functions

Using dplyr functions and pipes

select

select

filter

filter

mutate

mutate

mutate and str_replace

mutate and str_replace_all

rename_all

group_by

group_by and n

group_by, n, and ungroup

summarize

arrange

left_join

CSEP 590
Building Data Analysis Pipelines

`Tibble` vs. data frame

`Tibble` vs. data frame

The `tidyverse` packages

Using `dplyr` functions

Using `dplyr` functions

Using `dplyr` functions

Using `dplyr` functions and pipes

`select`

`select`

`filter`

`filter`

`mutate`

`mutate`

`mutate` and `str_replace`

`mutate` and `str_replace_all`

`rename_all`

`group_by`

`group_by` and `n`

`group_by`, `n`, and `ungroup`

`summarize`

`arrange`

`left_join`