In-class exercise Big data: Instructions

High-level goal

The high-level goal of this exercise is to gain experience with high-performance computing and processing big data with R.

Note: The datasets in this exercise are small and not representative of the volume of data you will typically encounter when dealing with big data. The key idea is to focus on the conceptual bits and keep the resource requirements and analysis runtime manageable. Even if the data was orders of magnitude larger, the core concepts around compute-bound and memory-bound computations are the same.

Set up

  • Team up in groups of size 2, and self-assign to a group (In-class-5-big-data groupset) on Canvas. (If you are in a Canvas group of size 1, you can still submit.) In the past, groups found a pair-programming set up (in person or using screen-sharing for remote work) to be beneficial.

  • You may complete this exercise on your own machine, on attu, or using a cloud-based notebook. (Note: Rcpp works well on Colab, but Spark and sparklyr may require a bit more tinkering.)

Part 1: Rcpp

Instructions

  1. Write an R function meanR that computes the mean of a numeric vector by explicitly looping over the vector elements (i.e., summing all elements).

  2. Write an analogous C++ function meanCpp.

  3. Create a vector vec with 1 million entries such that the entries are normally distributed with mean 0 and standard deviation of 1.

  4. Make sure that meanR and meanCpp compute the expected result (compare against the expected mean value and the result of the built-in mean function.)

  5. Use the microbenchmark package and compare the runtimes of meanR, meanCpp, and the built-in mean function – over 100 runs, using vec as input.

  6. Visualize the results of the microbenchmarking, showing the three distributions.

Questions

  • Q1 Briefly describe your observations and contrast these with your observations and hypotheses from in-class 1 (part 2).

  • Q2 Briefly justify your choice for the visualization in step 6.

Part 2: Sparklyr

  1. Use the sparklyr package for this part of the exercise.

  2. Create a Spark session. (You may use a standalone session or connect to a running Spark cluster.)

  3. Load two data files from HW1 into your Spark session: code_stats.csv and bug_stats.csv.

  4. Consolidate the resulting two spark data frames such that the resulting spark data frame has one row per bug (PID, BID).

  5. Without materializing the consolidated spark data frame into local memory, compute the following:

    • The average number of added lines per project.
    • The number of bugs with at least 1 deleted line – overall and per project.
  6. Materialize the consolidated spark data frame into local memory, and repeat the computations from step 11.

  7. Visualize the computed results, using a plot of your choice.

Questions

  • Q3 In your own words, explain how lazy evaluation in sparklyr works.

  • Q4 In your own words, explain what a tbl_spark represents.

  • Q5 In your own words, explain the difference between compute() and collect() when materializing a tbl_spark.

Deliverables

  1. An executable Quarto project (.qmd). Please list all group members at the top of your submission.

Steps for turn-in

One team member should upload the deliverables to Canvas.