CSE 599K
Empirical Research Methods

Winter 2025

Data wrangling

Tibble vs. data frame

  • Better performance (column layout)
  • More “type safety” and less magic
  • Better support for common interactive tasks

mtcars is a built-in data set in R

Tibble vs. data frame

Tibbles printed to the console output only the first 10 rows

Print the entire df and tibble to see how they are rendered differently.

The tidyverse packages

  • Functions are verbs
    • select
    • filter
    • mutate
    • group_by
    • summarize
    • etc.
  • Consistently uses ’_’ as seperator (e.g., group_by)
  • Non-standard evaluation (no need for quotes and qualifiers)

Base R: subsetting/slicing data

Slicing the built-in mtcars data set.

Base R: subsetting/slicing data

R’s indexing operators

Pros:

  • The basic indexing operators [], [[]], and $ are concise.

Cons:

  • They quickly become hard to read in complex data wrangling pipelines. Intent is implicit.

  • May return unexpected results (e.g., selecting a single column in a data frame by default returns a 1-dim vector instead of a 2-dim data frame with a single column-column; use drop=FALSE to preserve dimensionality).

Using dplyr functions

Explicit indexing operations:

head(select(filter(mtcars, am==1), mpg, cyl, gear), n=5)

With improved code formatting:

head(
  select(
    filter(
      mtcars,
      am==1
    ),
    mpg,
    cyl,
    gear
   ),
   n=5)

Why is this still hard to read?

Using dplyr functions

Explicit indexing operations:

head(select(filter(mtcars, am==1), mpg, cyl, gear), n=5)

With local variables to improve readability:

Using dplyr functions

Explicit indexing operations:

head(select(filter(mtcars, am==1), mpg, cyl, gear), n=5)

With local variables to improve readability:

Local variables improve readability, but this approach is prone to errors.

Using dplyr functions and pipes

Explicit indexing operations with pipes:

Optimize for readability: code is written once but read many times!

select

Select specific columns (include):

select

Select specific columns (exclude):

filter

Filter rows:

filter

Filter rows with %in%:

mutate

Add a column:

mutate

Change a column type:

mutate and str_replace

Change column values (replace with regex):

We use the stringr package here

See str_replace, str_replace_all, and str_replace_na.

mutate and str_replace_all

Change column values (replace with regex):

We use the stringr package here

Use c("<pattern>"="<new value>", ...) in str_replace_all.

rename_all

group_by

A grouped tibble affects downstream operations

Operations are applied to each group.

group_by and n

group_by, n, and ungroup

summarize

Handling of NA values

The mean, median, etc. of a vector that includes NA values is always NA.

Use na.rm=T (or na.rm=TRUE) to drop NA values.

arrange

Order data by column:

arrange orders in ascending order by default

Use arrange(desc(n_flights)) to order in descending order.

left_join