HW2: Refined end-to-end data pipeline and analysis

High-level goal

The high-level goal of this exercise is two-fold: (1) revisit and improve the end-to-end data analysis developed in the previous homework and (2) gain experience with distributed computing.

Set up

This is an individual exercise!
You can complete this exercise on your own machine or on attu.
This exercise involves four main steps:
1. Addressing HW1 feedback
2. Adding effective ggplot2 visualizations
3. Incorporating distributed computing with Spark
4. Building Quarto reports
You are free to use any IDE or editor, but we do require:
- The final deliverable (submission) is an executable Quarto project, with all code cells written in R and documentation written in markdown.
- The submitted Quarto project produces two reports: (1) a detailed analysis report and (2) a higher-level summary report.
- Rendering the submitted Quarto project must successfully produce the two reports.
- Clean and concise code, with a consistent coding style and use of the following packages:
  - tidyverse for data wrangling.
  - ggplot2 for visualizations.
  - sparklyr for interfacing with Spark.
- Adequate testing of data properties, data transformations, etc.
Getting help:
- Data/Stats: See the corresponding threads for HW1. If you can’t find an answer, post a general questions on Ed ([HW2] Data questions thread).
- Visualization: See the demos on the course website and the linked references. If you can’t find an answer, post a general question on Ed ([HW2] ggplot2 questions thread).
- Quarto/Spark(lyr): See the linked references and tutorials. If you can’t find an answer, post a general questions on Ed ([HW2] Infra questions thread).
- Solution-specific questions: Send the course staff a private Ed message (HW2 category).
Packages:
- tidyverse: See the package descriptions and cheat sheets and the R for Data Science online book.
- ggplot2: (Included in tidyverse) See the overview and the ggplot2 book.
- assertthat: See the package description.
- sparklyr: See the package description and tutorials.

Background

The previous homework provided background information about the data set and domain. For simplicity, this homework focuses on the same data set and domain.

Note: This exercise aims to strike a balance between teaching big-data concepts and the overhead of setting up an actual infrastructure that would scale to big data. In a real-world scenario, your analysis pipeline will most likely interface with an existing cluster or some form of distributed data storage. That said, the concepts are the same independently of how small or big the target data set is – in part due to well-designed Spark and SQL backends for powerful packages such as dplyr.

Instructions

HW1 feedback

HW1 feedback has two clearly labeled sections:

Expected changes: Incorporating these changes will allow you to get up to 90% of lost points back on HW1.
Optional changes: Incorporating these changes will strengthen your data analysis, but these changes are not strictly required for HW2.

Visualizations (35%)

Add at least four plots (across your two reports).
At least one plot must be a facet plot with an additional aesthetic being used for a different dimension of the data.
There is no limit on the number of plots; feel free to explore different visualizations and retain those visualizations in your detailed analysis report.
Make sure to properly describe all visualizations.

Note: You may provide visualizations for the consolidated data set or your analysis results of that data set. For example, you may visualize code/test/bug complexity across different dimensions, independently of your statistical analyses.

Distributed computing with Spark (35%)

Use the sparklyr package to interface with Spark for two purposes:

Consolidate all five data files in Spark, before materializing the final data frame into memory.
Distributed computing, such as fitting multiple models on the same data, fitting the same model on different groups of the same data, or parallelizing a computation.

Note: For a more realistic exercise, you may want to work with the disaggregated data provided on the course website.

Quarto reports (30%)

Develop a Quarto project that produces two html reports:

A detailed analysis report – this is your improved, executable notebook from HW1. Focus on how you designed and conducted your analysis and why you made those choices, in addition to providing the analysis code and results.
A higher-level summary report. Focus on what are the key takeaways and what are the implications and potential next steps.

An analogy for the difference between the two reports is a research paper vs. a conference presentation or a news article. The latter two are usually presented at a much higher level, consolidating individual analysis results and focusing on the bigger picture.

Note: While not strictly required for this exercise, it is a good practice to separate data processing and data visualization. For example, your detailed analysis may generate an aggregated data set that is then ingested and visualized by the code that creates the summary report. Feel free to modularize your analysis pipeline.

Deliverables

An executable Quarto project that, if rendered, will execute your analysis and produce the two reports.

Steps for turn-in

Upload the deliverables to Canvas.