--- title: "In-class exercise Data wrangling" author: "" format: html jupyter: ir engine: knitr # Set to true to evaluate the code cells when rendering the notebook eval: false # Embed notebook view, together with a download link of the qmd notebook notebook-view: - notebook: data_wrangling.qmd title: "Data wrangling: qmd notebook" --- ```{r} library(tidyverse) library(nycflights23) ``` ## Part 1: Data consolidation ### Instructions 1. Recall the [nycflights23 package](https://cran.r-project.org/web/packages/nycflights23/nycflights23.pdf), which provides the following data frames (tibbles): - `airlines` - `airports` - `flights` - `planes` - `weather` 2. Perform the following left join, and inspect the output message and the result: ```{r} left_join(flights, planes) %>% head(10) ``` What is the problem with this join? Come up with two strategies to solve the problem and choose one strategy after considering the pros/cons of each. 3. Create a consolidated data frame, by *selectively* joining the five data frames from the `nycflights23` package: - From `flights`: all columns - From `airlines`: *name* - From `airports`: *name* (for origin and destination) - From `planes`: *year* (manufactured) and *manufacturer* - From `weather`: *visib* Use the appropriate `_join` function(s) from the `dplyr` package. ```{r} ``` 4. Export the consolidated data frame to a csv file named *consolidated.csv*. Use the `write_csv` function from the `readr` package. ```{r} ``` ### Questions * Q1 Why was the first attempt of executing `left_join(flights, planes)` incorrect? Describe the two strategies that you considered to avoid such pitfalls, contrasting the pros and cons for each. * Q2 In your own words, what are the differences between left join, inner join, and full join? Briefly describe a use case for each join, in the context of the data provided by the `nycflights23` package. ## Part 2: Data cleaning and reshaping ### Instructions 5. Read the exported csv (*consolidated.csv*). Use the `read_csv` function from the `readr` package. *Use the imported consolidated data frame for all subsequent instructions.* ```{r} ``` 7. Break down the number of flights by manufacturer (name) of the corresponding aircraft. What is the issue with the data? ```{r} ``` 8. Clean up and simplify the manufacturer data: change manufacturer name to be a categorical variable -- a factor with the following levels: - Boeing - Airbus - Embraer - Other (While the clean-up operation may appear complex, a combination of dplyr's `mutate` function with an adequate `stringr` functions will drastically simplify the code. Ask the course staff for help if you get stuck after conceptually specifying your data transformation!) 10. Compute the *mean* arrival delay per origin and month and sort the output by origin and mean arrival delay (ignore NA values with `na.rm=T`). ```{r} ``` 11. Reshape the resulting data frame to be in wide format with the following columns: - `month` - `EWR` - `JFK` - `LGA` (Hint: `pivot_wider`) ```{r} ``` 12. Add a column `quarter` that groups `month` by quarter of year (i.e., months 1--3 all belong to the 1st quarter; months 4--6 to the 2nd quarter, etc.). (Hint: Use lubridate's `quarter` function or a combination of `mutate` and `cut` to introduce the `quarter` column.) ```{r} ``` 13. Reshape the data frame with the new quarter column (which is still in wide format) to long data format with the following columns: - `month` - `quarter` - `origin` - `delay` (Hint: `pivot_longer`) ```{r} ``` ### Questions * Q3 What issue(s) did you observe with the manufacturer names? Describe two possible strategies for building a robust data analysis pipeline that accounts for these types of data issues. * Q4 Provide the sorted table that gives the mean arrival delay per origin and month. * Q5 For convenience, the instructions above simply removed NA values when computing aggregated statistics. Describe two other possible strategies for cleaning up NA values in a data set. * Q6 (Optional) Identify the top-5 carriers by flight volume across all airports. Then filter the data for these 5 carriers and break down the mean arrival delay by carrier, origin airport, and season (Spring, Summer, Fall, Winter).