In-class exercise Data Wrangling: Instructions
High-level goal
The high-level goal of this exercise is to gain experience with common data-wrangling operations, common data types, and robustness.
Set up
Team up in groups of size 2, and self-assign to a group (In-class-2-data-wrangling groupset) on Canvas. (If you are in a Canvas group of size 1, you can still submit.) In the past, groups found a pair-programming set up (in person or using screen-sharing for remote work) to be beneficial.
Set up R: Any one of the following options is sufficient for this exercise:
- Local: Install R on your machine.
- Local: Install RStudio on your machine.
- Web-based: Use a computational notebook such as Jupyter or Google Colab.
- Web-based: Use the WebR REPL (convenient but still early release).
The (free) web-based options offer a very convenient way of exploring R. They are sufficient for most exercises, but you may need additional resources for some of the big-data analyses towards the end of the course.
Start an interactive R runtime (R, RStudio, or a web-based instance).
Install (if needed) and load the following packages:
tidyverse
nycflights23
Part 1: Data consolidation
Instructions
- Recall the nycflights23 package, which provides the following data frames (tibbles):
airlines
airports
flights
planes
weather
- Perform the following left join, and inspect the output message and the result:
left_join(flights, planes) %>% head(10)
What is the problem with this join? Come up with two strategies to solve the problem and choose one strategy after considering the pros/cons of each.
Create a consolidated data frame, by selectively joining the five data frames from the
nycflights23
package:- From
flights
: all columns - From
airlines
: name - From
airports
: name (for origin and destination) - From
planes
: year (manufactured) and manufacturer - From
weather
: visib
Use the appropriate
_join
function(s) from thedplyr
package.- From
Export the consolidated data frame to a csv file named consolidated.csv. Use the
write_csv
function from thereadr
package.
Questions
Q1 Why was the first attempt of executing
left_join(flights, planes)
incorrect? Describe the two strategies that you considered to avoid such pitfalls, contrasting the pros and cons for each.Q2 In your own words, what are the differences between left join, inner join, and full join? Briefly describe a use case for each join, in the context of the data provided by the
nycflights23
package.
Part 2: Data cleaning and reshaping
Read the exported csv (consolidated.csv). Use the
read_csv
function from thereadr
package.For all subsequent instructions, use the imported consolidated data frame.
Break down the number of flights by manufacturer (name) of the corresponding aircraft. What is the issue with the data?
Clean up and simplify the manufacturer data: change manufacturer name to be a categorical variable – a factor with the following levels:
- Boeing
- Airbus
- Embraer
- Other
(While the clean-up operation may appear complex, a combination of dplyr’s
mutate
function and the stringrstr_replace
functions will drastically simplify the code. Ask the course staff for help if you get stuck after conceptually specifying your data transformation!)Compute the median arrival delay per origin and manufacturer, and sort the output by origin and median arrival delay (ignore NA values with
na.rm=T
).Compute the mean arrival delay per origin and month (ignore NA values with
na.rm=T
).Reshape the resulting data frame to be in wide format with the following columns:
month
EWR
JFK
LGA
(Hint:
pivot_wider(names_from = origin, values_from = delay)
)Add a column
quarter
that groupsmonth
by quarter of year (i.e., months 1–3 all belong to the 1st quarter; months 4–6 to the 2nd quarter, etc.).(Hint: Use
mutate
andcut
to introduce thequarter
column.)Reshape the data frame with the new quarter column (which is still in wide format) to long data format.
(Hint:
pivot_longer(cols = c(EWR, JFK, LGA), names_to = "origin", values_to = "delay")
)Plot the resulting data frame.
(Hint:
ggplot() + geom_boxplot(aes(x=quarter, y=delay, fill=origin)) + theme_bw()
)
Questions
Q3 What issue(s) did you observe with the manufacturer names? Describe two possible strategies for building a robust data analysis pipeline that accounts for these types of data issues.
Q4 Provide the sorted table that gives the median arrival delay per origin and manufacturer.
Q5 For convenience, the instructions above simply removed NA values when computing aggregated statistics. Describe two other possible strategies for cleaning up NA values in a data set.
Q6 Contrast the long and wide data format: what are the pros and cons and which is preferable in what context? Give one concrete use case for each data format.
Q7 Provide the boxplot that shows the distribution of mean arrival delay per quarter and origin airport.
Q8 (Optional) Identify the top-5 carriers by flight volume across all airports. Then filter the data for these 5 carriers and break down the mean arrival delay by carrier, origin airport, and season (Spring, Summer, Fall, Winter).
Deliverables
- A plain-text (or PDF) file with your answers to the 8 questions above. Please list all group members at the top of your submission.
Steps for turn-in
One team member should upload the deliverables to Canvas.