In-class exercise Data Wrangling: Instructions

High-level goal

The high-level goal of this exercise is to gain experience with common data-wrangling operations, common data types, and robustness.

Set up

Team up in groups of size 2, and self-assign to a group (In-class-2-data-wrangling groupset) on Canvas. (If you are in a Canvas group of size 1, you can still submit.) In the past, groups found a pair-programming set up (in person or using screen-sharing for remote work) to be beneficial.
Set up R: Any one of the following options is sufficient for this exercise:
1. Local: Install R on your machine.
2. Local: Install RStudio on your machine.
3. Web-based: Use a computational notebook such as Jupyter or Google Colab.
4. Web-based: Use the WebR REPL (convenient but still early release).
The (free) web-based options offer a very convenient way of exploring R. They are sufficient for most exercises, but you may need additional resources for some of the big-data analyses towards the end of the course.
Start an interactive R runtime (R, RStudio, or a web-based instance).
Install (if needed) and load the following packages:
- tidyverse
- nycflights23

Part 1: Data consolidation

Instructions

Recall the nycflights23 package, which provides the following data frames (tibbles):

airlines
airports
flights
planes
weather

Perform the following left join, and inspect the output message and the result:

left_join(flights, planes) %>% head(10)

What is the problem with this join? Come up with two strategies to solve the problem and choose one strategy after considering the pros/cons of each.

Create a consolidated data frame, by selectively joining the five data frames from the nycflights23 package:
- From flights: all columns
- From airlines: name
- From airports: name (for origin and destination)
- From planes: year (manufactured) and manufacturer
- From weather: visib
Use the appropriate _join function(s) from the dplyr package.
Export the consolidated data frame to a csv file named consolidated.csv. Use the write_csv function from the readr package.

Questions

Q1 Why was the first attempt of executing left_join(flights, planes) incorrect? Describe the two strategies that you considered to avoid such pitfalls, contrasting the pros and cons for each.
Q2 In your own words, what are the differences between left join, inner join, and full join? Briefly describe a use case for each join, in the context of the data provided by the nycflights23 package.

Part 2: Data cleaning and reshaping

Read the exported csv (consolidated.csv). Use the read_csv function from the readr package.
For all subsequent instructions, use the imported consolidated data frame.
Break down the number of flights by manufacturer (name) of the corresponding aircraft. What is the issue with the data?
Clean up and simplify the manufacturer data: change manufacturer name to be a categorical variable – a factor with the following levels:
- Boeing
- Airbus
- Embraer
- Other
(While the clean-up operation may appear complex, a combination of dplyr’s mutate function and the stringr str_replace functions will drastically simplify the code. Ask the course staff for help if you get stuck after conceptually specifying your data transformation!)
Compute the median arrival delay per origin and manufacturer, and sort the output by origin and median arrival delay (ignore NA values with na.rm=T).
Compute the mean arrival delay per origin and month (ignore NA values with na.rm=T).
Reshape the resulting data frame to be in wide format with the following columns:
- month
- EWR
- JFK
- LGA
(Hint: pivot_wider(names_from = origin, values_from = delay))
Add a column quarter that groups month by quarter of year (i.e., months 1–3 all belong to the 1st quarter; months 4–6 to the 2nd quarter, etc.).

(Hint: Use mutate and cut to introduce the quarter column.)
Reshape the data frame with the new quarter column (which is still in wide format) to long data format.

(Hint: pivot_longer(cols = c(EWR, JFK, LGA), names_to = "origin", values_to = "delay"))
Plot the resulting data frame.

(Hint: ggplot() + geom_boxplot(aes(x=quarter, y=delay, fill=origin)) + theme_bw())

Questions

Q3 What issue(s) did you observe with the manufacturer names? Describe two possible strategies for building a robust data analysis pipeline that accounts for these types of data issues.
Q4 Provide the sorted table that gives the median arrival delay per origin and manufacturer.
Q5 For convenience, the instructions above simply removed NA values when computing aggregated statistics. Describe two other possible strategies for cleaning up NA values in a data set.
Q6 Contrast the long and wide data format: what are the pros and cons and which is preferable in what context? Give one concrete use case for each data format.
Q7 Provide the boxplot that shows the distribution of mean arrival delay per quarter and origin airport.
Q8 (Optional) Identify the top-5 carriers by flight volume across all airports. Then filter the data for these 5 carriers and break down the mean arrival delay by carrier, origin airport, and season (Spring, Summer, Fall, Winter).

Deliverables

A plain-text (or PDF) file with your answers to the 8 questions above. Please list all group members at the top of your submission.

Steps for turn-in

One team member should upload the deliverables to Canvas.