library(tidyverse)
library(nycflights23)
In-class exercise Data wrangling
In [1]:
Part 1: Data consolidation
Instructions
- Recall the nycflights23 package, which provides the following data frames (tibbles):
airlines
airports
flights
planes
weather
- Perform the following left join, and inspect the output message and the result:
In [2]:
left_join(flights, planes) %>% head(10)
What is the problem with this join? Come up with two strategies to solve the problem and choose one strategy after considering the pros/cons of each.
- Create a consolidated data frame, by selectively joining the five data frames from the
nycflights23
package:- From
flights
: all columns - From
airlines
: name - From
airports
: name (for origin and destination) - From
planes
: year (manufactured) and manufacturer - From
weather
: visib
_join
function(s) from thedplyr
package. - From
- Export the consolidated data frame to a csv file named consolidated.csv. Use the
write_csv
function from thereadr
package.
Questions
Q1 Why was the first attempt of executing
left_join(flights, planes)
incorrect? Describe the two strategies that you considered to avoid such pitfalls, contrasting the pros and cons for each.Q2 In your own words, what are the differences between left join, inner join, and full join? Briefly describe a use case for each join, in the context of the data provided by the
nycflights23
package.
Part 2: Data cleaning and reshaping
Instructions
Read the exported csv (consolidated.csv). Use the
read_csv
function from thereadr
package.Use the imported consolidated data frame for all subsequent instructions.
- Break down the number of flights by manufacturer (name) of the corresponding aircraft. What is the issue with the data?
Clean up and simplify the manufacturer data: change manufacturer name to be a categorical variable – a factor with the following levels:
- Boeing
- Airbus
- Embraer
- Other
(While the clean-up operation may appear complex, a combination of dplyr’s
mutate
function with an adequatestringr
functions will drastically simplify the code. Ask the course staff for help if you get stuck after conceptually specifying your data transformation!)Compute the mean arrival delay per origin and month and sort the output by origin and mean arrival delay (ignore NA values with
na.rm=T
).
- Reshape the resulting data frame to be in wide format with the following columns:
month
EWR
JFK
LGA
pivot_wider
)
Add a column
quarter
that groupsmonth
by quarter of year (i.e., months 1–3 all belong to the 1st quarter; months 4–6 to the 2nd quarter, etc.).(Hint: Use lubridate’s
quarter
function or a combination ofmutate
andcut
to introduce thequarter
column.)
- Reshape the data frame with the new quarter column (which is still in wide format) to long data format with the following columns:
month
quarter
origin
delay
pivot_longer
)
Questions
Q3 What issue(s) did you observe with the manufacturer names? Describe two possible strategies for building a robust data analysis pipeline that accounts for these types of data issues.
Q4 Provide the sorted table that gives the mean arrival delay per origin and month.
Q5 For convenience, the instructions above simply removed NA values when computing aggregated statistics. Describe two other possible strategies for cleaning up NA values in a data set.
Q6 (Optional) Identify the top-5 carriers by flight volume across all airports. Then filter the data for these 5 carriers and break down the mean arrival delay by carrier, origin airport, and season (Spring, Summer, Fall, Winter).