CSE 599 Demos – CSE 599K Empirical Research Methods

CSE 599K
Empirical Research Methods

Winter 2025

Data wrangling

`Tibble` vs. data frame

Better performance (column layout)
More “type safety” and less magic
Better support for common interactive tasks

mtcars is a built-in data set in R

`Tibble` vs. data frame

Tibbles printed to the console output only the first 10 rows

Print the entire df and tibble to see how they are rendered differently.

The `tidyverse` packages

Functions are verbs
- select
- filter
- mutate
- group_by
- summarize
- etc.
Consistently uses ’_’ as seperator (e.g., group_by)
Non-standard evaluation (no need for quotes and qualifiers)

Base R: subsetting/slicing data

Slicing the built-in mtcars data set.

Base R: subsetting/slicing data

R’s indexing operators

Pros:

The basic indexing operators [], [[]], and $ are concise.

Cons:

They quickly become hard to read in complex data wrangling pipelines. Intent is implicit.
May return unexpected results (e.g., selecting a single column in a data frame by default returns a 1-dim vector instead of a 2-dim data frame with a single column-column; use drop=FALSE to preserve dimensionality).

Using `dplyr` functions

Explicit indexing operations:

head(select(filter(mtcars, am==1), mpg, cyl, gear), n=5)

With improved code formatting:

head(
  select(
    filter(
      mtcars,
      am==1
    ),
    mpg,
    cyl,
    gear
   ),
   n=5)

Why is this still hard to read?

Using `dplyr` functions

Explicit indexing operations:

head(select(filter(mtcars, am==1), mpg, cyl, gear), n=5)

With local variables to improve readability:

Using `dplyr` functions

Explicit indexing operations:

head(select(filter(mtcars, am==1), mpg, cyl, gear), n=5)

With local variables to improve readability:

Local variables improve readability, but this approach is prone to errors.

Using `dplyr` functions and pipes

Explicit indexing operations with pipes:

Optimize for readability: code is written once but read many times!

`select`

Select specific columns (include):

`select`

Select specific columns (exclude):

`filter`

Filter rows:

`filter`

Filter rows with %in%:

`mutate`

Add a column:

`mutate`

Change a column type:

`mutate` and `str_replace`

Change column values (replace with regex):

We use the stringr package here

See str_replace, str_replace_all, and str_replace_na.

`mutate` and `str_replace_all`

Change column values (replace with regex):

We use the stringr package here

Use c("<pattern>"="<new value>", ...) in str_replace_all.

`rename_all`

`group_by`

A grouped tibble affects downstream operations

Operations are applied to each group.

`group_by` and `n`

`group_by`, `n`, and `ungroup`

`summarize`

Handling of NA values

The mean, median, etc. of a vector that includes NA values is always NA.

Use na.rm=T (or na.rm=TRUE) to drop NA values.

`arrange`

Order data by column:

arrange orders in ascending order by default

Use arrange(desc(n_flights)) to order in descending order.

CSE 599KEmpirical Research Methods

Tibble vs. data frame

Tibble vs. data frame

The tidyverse packages

Base R: subsetting/slicing data

Base R: subsetting/slicing data

Using dplyr functions

Using dplyr functions

Using dplyr functions

Using dplyr functions and pipes

select

select

filter

filter

mutate

mutate

mutate and str_replace

mutate and str_replace_all

rename_all

group_by

group_by and n

group_by, n, and ungroup

summarize

arrange

left_join

CSE 599K
Empirical Research Methods

`Tibble` vs. data frame

`Tibble` vs. data frame

The `tidyverse` packages

Using `dplyr` functions

Using `dplyr` functions

Using `dplyr` functions

Using `dplyr` functions and pipes

`select`

`select`

`filter`

`filter`

`mutate`

`mutate`

`mutate` and `str_replace`

`mutate` and `str_replace_all`

`rename_all`

`group_by`

`group_by` and `n`

`group_by`, `n`, and `ungroup`

`summarize`

`arrange`

`left_join`