Back to Article
Starter code: qmd notebook
Download Source

In-class exercise Data wrangling

Author

In [1]:
library(tidyverse)
library(nycflights23)

Part 1: Data consolidation

Instructions

  1. Recall the nycflights23 package, which provides the following data frames (tibbles):
  • airlines
  • airports
  • flights
  • planes
  • weather
  1. Perform the following left join, and inspect the output message and the result:
In [2]:
left_join(flights, planes) %>% head(10)

What is the problem with this join? Come up with two strategies to solve the problem and choose one strategy after considering the pros/cons of each.

  1. Create a consolidated data frame, by selectively joining the five data frames from the nycflights23 package:
    • From flights: all columns
    • From airlines: name
    • From airports: name (for origin and destination)
    • From planes: year (manufactured) and manufacturer
    • From weather: visib
    Use the appropriate _join function(s) from the dplyr package.
  1. Export the consolidated data frame to a csv file named consolidated.csv. Use the write_csv function from the readr package.

Questions

  • Q1 Why was the first attempt of executing left_join(flights, planes) incorrect? Describe the two strategies that you considered to avoid such pitfalls, contrasting the pros and cons for each.

  • Q2 In your own words, what are the differences between left join, inner join, and full join? Briefly describe a use case for each join, in the context of the data provided by the nycflights23 package.

Part 2: Data cleaning and reshaping

Instructions

  1. Read the exported csv (consolidated.csv). Use the read_csv function from the readr package.

    Use the imported consolidated data frame for all subsequent instructions.

  1. Break down the number of flights by manufacturer (name) of the corresponding aircraft. What is the issue with the data?
  1. Clean up and simplify the manufacturer data: change manufacturer name to be a categorical variable – a factor with the following levels:

    • Boeing
    • Airbus
    • Embraer
    • Other

    (While the clean-up operation may appear complex, a combination of dplyr’s mutate function with an adequate stringr functions will drastically simplify the code. Ask the course staff for help if you get stuck after conceptually specifying your data transformation!)

  2. Compute the mean arrival delay per origin and month and sort the output by origin and mean arrival delay (ignore NA values with na.rm=T).

  1. Reshape the resulting data frame to be in wide format with the following columns:
    • month
    • EWR
    • JFK
    • LGA
    (Hint: pivot_wider)
  1. Add a column quarter that groups month by quarter of year (i.e., months 1–3 all belong to the 1st quarter; months 4–6 to the 2nd quarter, etc.).

    (Hint: Use lubridate’s quarter function or a combination of mutate and cut to introduce the quarter column.)

  1. Reshape the data frame with the new quarter column (which is still in wide format) to long data format with the following columns:
    • month
    • quarter
    • origin
    • delay
    (Hint: pivot_longer)

Questions

  • Q3 What issue(s) did you observe with the manufacturer names? Describe two possible strategies for building a robust data analysis pipeline that accounts for these types of data issues.

  • Q4 Provide the sorted table that gives the mean arrival delay per origin and month.

  • Q5 For convenience, the instructions above simply removed NA values when computing aggregated statistics. Describe two other possible strategies for cleaning up NA values in a data set.

  • Q6 (Optional) Identify the top-5 carriers by flight volume across all airports. Then filter the data for these 5 carriers and break down the mean arrival delay by carrier, origin airport, and season (Spring, Summer, Fall, Winter).