In-class exercise Data wrangling

Author

In [1]:

library(tidyverse)
library(nycflights23)

Part 1: Data consolidation

Recall the nycflights23 package, which provides the following data frames (tibbles):

In [2]:

left_join(flights, planes) %>% head(10)

What is the problem with this join? Come up with two strategies to solve the problem and choose one strategy after considering the pros/cons of each.

Export the consolidated data frame to a csv file named consolidated.csv. Use the write_csv function from the readr package.

Q1 Why was the first attempt of executing left_join(flights, planes) incorrect? Describe the two strategies that you considered to avoid such pitfalls, contrasting the pros and cons for each.
Q2 In your own words, what are the differences between left join, inner join, and full join? Briefly describe a use case for each join, in the context of the data provided by the nycflights23 package.

Read the exported csv (consolidated.csv). Use the read_csv function from the readr package.

Use the imported consolidated data frame for all subsequent instructions.

Break down the number of flights by manufacturer (name) of the corresponding aircraft. What is the issue with the data?

Clean up and simplify the manufacturer data: change manufacturer name to be a categorical variable – a factor with the following levels:
- Boeing
- Airbus
- Embraer
- Other
(While the clean-up operation may appear complex, a combination of dplyr’s mutate function with an adequate stringr functions will drastically simplify the code. Ask the course staff for help if you get stuck after conceptually specifying your data transformation!)
Compute the mean arrival delay per origin and month and sort the output by origin and mean arrival delay (ignore NA values with na.rm=T).

Add a column quarter that groups month by quarter of year (i.e., months 1–3 all belong to the 1st quarter; months 4–6 to the 2nd quarter, etc.).

(Hint: Use lubridate’s quarter function or a combination of mutate and cut to introduce the quarter column.)

Q3 What issue(s) did you observe with the manufacturer names? Describe two possible strategies for building a robust data analysis pipeline that accounts for these types of data issues.
Q4 Provide the sorted table that gives the mean arrival delay per origin and month.
Q5 For convenience, the instructions above simply removed NA values when computing aggregated statistics. Describe two other possible strategies for cleaning up NA values in a data set.
Q6 (Optional) Identify the top-5 carriers by flight volume across all airports. Then filter the data for these 5 carriers and break down the mean arrival delay by carrier, origin airport, and season (Spring, Summer, Fall, Winter).