In-class exercise Big data: Instructions
High-level goal
The high-level goal of this exercise is to gain experience with high-performance computing and processing big data with R.
Note: The datasets in this exercise are small and not representative of the volume of data you will typically encounter when dealing with big data. The key idea is to focus on the conceptual bits and keep the resource requirements and analysis runtime manageable. Even if the data was orders of magnitude larger, the core concepts around compute-bound and memory-bound computations are the same.
Set up
Team up in groups of size 2, and self-assign to a group (In-class-5-big-data groupset) on Canvas. (If you are in a Canvas group of size 1, you can still submit.) In the past, groups found a pair-programming set up (in person or using screen-sharing for remote work) to be beneficial.
You may complete this exercise on your own machine, on attu, or using a cloud-based notebook. (Note: Rcpp works well on Colab, but Spark and sparklyr may require a bit more tinkering.)
Part 1: Rcpp
Instructions
Write an R function
meanR
that computes the mean of a numeric vector by explicitly looping over the vector elements (i.e., summing all elements).Write an analogous C++ function
meanCpp
.Create a vector
vec
with 1 million entries such that the entries are normally distributed with mean 0 and standard deviation of 1.Make sure that
meanR
andmeanCpp
compute the expected result (compare against the expected mean value and the result of the built-inmean
function.)Use the microbenchmark package and compare the runtimes of
meanR
,meanCpp
, and the built-inmean
function – over 100 runs, usingvec
as input.Visualize the results of the microbenchmarking, showing the three distributions.
Questions
Q1 Briefly describe your observations and contrast these with your observations and hypotheses from in-class 1 (part 2).
Q2 Briefly justify your choice for the visualization in step 6.
Part 2: Sparklyr
Use the
sparklyr
package for this part of the exercise.Create a Spark session. (You may use a standalone session or connect to a running Spark cluster.)
Load two data files from HW1 into your Spark session:
code_stats.csv
andbug_stats.csv
.Consolidate the resulting two spark data frames such that the resulting spark data frame has one row per bug (PID, BID).
Without materializing the consolidated spark data frame into local memory, compute the following:
- The average number of added lines per project.
- The number of bugs with at least 1 deleted line – overall and per project.
Materialize the consolidated spark data frame into local memory, and repeat the computations from step 11.
Visualize the computed results, using a plot of your choice.
Questions
Q3 In your own words, explain how lazy evaluation in
sparklyr
works.Q4 In your own words, explain what a
tbl_spark
represents.Q5 In your own words, explain the difference between
compute()
andcollect()
when materializing atbl_spark
.
Deliverables
- An executable Quarto project (.qmd). Please list all group members at the top of your submission.
Steps for turn-in
One team member should upload the deliverables to Canvas.