In-class exercise R Basics: Instructions

High-level goal

The high-level goal of this exercise is to gain experience with R, with a hands-on approach to exploring an existing data set.

Set up

Team up in groups of size 2, and self-assign to a group (In-class-1-R groupset) on Canvas. (If you are in a Canvas group of size 1, you can still submit.) In the past, groups found a pair-programming set up (in person or using screen-sharing for remote work) to be beneficial.
Set up R: Any one of the following options is sufficient for this exercise:
1. Local: Install R on your machine.
2. Local: Install RStudio on your machine.
3. Web-based: Use a computational notebook such as Jupyter or Google Colab.
4. Web-based: Use the WebR REPL (convenient but still early release).
The (free) web-based options offer a very convenient way of exploring R. They are sufficient for most exercises, but you may need additional resources for some of the big-data analyses towards the end of the course.

Part 1: The basics

Instructions

Start an interactive R runtime (R, RStudio, or a computational notebook instance).
Install the following packages, if needed:

install.packages(c("tidyverse", "nycflights23"))

Load the required libraries (packages):

library(tidyverse)
library(nycflights23)

Explore the flights data set, provided by the nycflights23 package:

# All column names of the data frame
colnames(flights)

# Structure and example data
str(flights)

# First n rows
head(flights)

# Last n rows
tail(flights)

# Summary statistics for all columns
summary(flights)

A first (simple) visualization

ggplot(flights) +
  geom_boxplot(aes(x=origin, y=distance)) +
  theme_bw()

Questions

Q1 Look at the documentation for the head function (?head in an interactive environment). How many rows does head print by default? What programming language feature is being used to achieve this?
Q2 Provide two syntactically different calls of the head function that both result in the same output of the first 3 rows of the flights data frame. Which syntax is preferable (to you) and why?
Q3 Given the output of summary(flights), what do you observe in terms of data types and descriptive statistics. Compare the summary statistics for carrier and time_hour, which is more useful? How would you improve usefulness of the output (no need to implement your solution)?
Q4 Change the plot to show the distribution of distance per month (i.e., x=month). Look at the warning message: what is the problem and what is the solution?
Q5 Augment the plot in the previous question: add fill=origin to the aesthetics mapping (i.e., the aes() function). What do you observe in the resulting plot?

Part 2: (Micro)benchmarking in R

Install the microbenchmark package, if needed.
Load the microbenchmark package.
Complete the following function to compute the sum of values in a given array:

mySum <- function(arr) {

  for (element in arr) {

  }

}

Ask the course staff for help if you get stuck.

Test your function mySum:

mySum(1:10)
mySum(c(1, 2, 3)) # Concatenate individual elements.
mySum(seq(from=0, to=100, by=10)) # Create a custom sequence.

Microbenchmark your sum function against R’s:

# Set the random seed for reproducibility.
set.seed(0)

# Ordered vector
myVec <- 1:10000
head(myVec)

# Randomized vector
rndVec <- sample(myVec)

# Benchmark our implementation vs. R's
m <- microbenchmark(
  mySum(rndVec), 
  sum(rndVec)
)
summary(m)

Questions

Q6 What is the running time complexity of mySum?
Q7 What did you observe when benchmarking mySum against R’s sum?
Provide two plausible explanations for your observations. (Hypothesize and confirm different reasons w/ online documentation or the course staff).
Q8 Given your observations, what are the implications for writing efficient analyses in R?

Part 3: Exploratory data analysis

Instructions

Brainstorm three questions or hypotheses (with your team member and/or course staff) about the nycflights23 data set. Answering each question should require data-slicing operations and/or involve a comparison between two or more groups.
(For example, answering “how many flights are there in total” or “what is the average arrival delay across all flights” is insufficient.)
Answer each of your three questions by analyzing the nycflights23 data set, using the tidyverse packages for analysis. If you are new to R, focus on the following dplyr functions: count, select, filter, group_by, and summarize.

Questions

Q9 List your three questions and corresponding answers, together with your analysis code and outputs.
Q10 Optional (advanced): provide a visualization for at least two of your answers.
Q11 Optional (going deep): Explain the difference between the two pipes |> and %>%. Describe an example use case where one is preferable to the other.

Deliverables

A plain-text (or PDF) file with your answers to the 9 (or more) questions above. Please list all group members at the top of your submission.

Steps for turn-in

One team member should upload the deliverables to Canvas.