Fall 2024
Data wrangling
Tibble
vs. data framemtcars
is a built-in data set in R
Tibble
vs. data frameTibbles printed to the console output only the first 10 rows
Print the entire df
and tibble
to see how they are rendered differently.
tidyverse
packagesselect
filter
mutate
group_by
summarize
group_by
)Slicing the built-in mtcars
data set.
R’s indexing operators
The basic indexing operators []
, [[]]
, and $
are useful for simple tasks but may return unexpected results and quickly become hard to read.
dplyr
functionsExplicit indexing operations:
dplyr
functionsExplicit indexing operations:
With local variables to improve readability:
dplyr
functionsExplicit indexing operations:
With local variables to improve readability:
Local variables improve readability, but this approach is prone to errors.
dplyr
functions and pipesExplicit indexing operations with pipes:
Optimize for readability: code is written once but read many times!
select
Select specific columns (include):
select
Select specific columns (exclude):
filter
Filter rows:
filter
Filter rows with %in%
:
mutate
Add a column:
mutate
Change a column type:
mutate
and str_replace
Change column values (replace with regex):
We use the stringr
package here
See str_replace
, str_replace_all
, and str_replace_na
.
mutate
and str_replace_all
Change column values (replace with regex):
We use the stringr
package here
Use c("<pattern>"="<new value>", ...)
in str_replace_all
.
rename_all
group_by
A grouped tibble affects downstream operations
Operations are applied to each group.
group_by
and n
group_by
, n
, and ungroup
summarize
Handling of NA values
The mean, median, etc. of a vector that includes NA values is always NA.
Use na.rm=T
(or na.rm=TRUE
) to drop NA values.
arrange
Order data by column:
arrange
orders in ascending order by default
Use arrange(desc(n_flights))
to order in descending order.
left_join