In-class exercise Statistical Significance and Power: Instructions

High-level goal

The high-level goal of this exercise is twofold: (1) deepen your understanding of parametric and non-parametric statistics and (2) deepen your understanding of confidence intervals.

Set up

  • Team up in groups of size 2, and self-assign to a group (In-class-4-stats-nhst) on Canvas. (If you are in a Canvas group of size 1, you can still submit.) In the past, groups found a pair-programming set up (in person or using screen-sharing for remote work) to be beneficial.

  • Set up R: Any one of the following options is sufficient for this exercise:

    1. Local: Install R on your machine.
    2. Local: Install RStudio on your machine.
    3. Web-based: Use a computational notebook such as Jupyter or Google Colab.
    4. Web-based: Use the WebR REPL (convenient but still early release).

    The (free) web-based options offer a very convenient way of exploring R. They are sufficient for most exercises, but you may need additional resources for some of the big-data analyses towards the end of the course.

Part 1: Parametric vs. non-parametric statistics

Instructions

1. Install (if needed) and load the following packages

suppressPackageStartupMessages({
  library(tidyverse)
  library(assertthat)
  library(effsize)
})

2. Load the runtime.csv dataset from the course website

rt_data <- read_csv("https://homes.cs.washington.edu/~rjust/courses/CSEP590/in_class/04_stats/data/runtime.csv", show_col_types=F) 

For faster (local) exploration you may download the dataset and load it from a local file. However, make sure that your final submission reads the data from the given URL.

3. Inspect the data set

head(rt_data)

This dataset provides benchmark results (runtime data) for a new program analysis approach (MySystem), compared to a baseline approach (Baseline). Specifically, the columns provide the following information:

  • Subject: One of three benchmark programs (tax, tictactoe, triangle).
  • VariantID: ID of a program variant. Each subject program has a different number of variants. For a given subject program, all variants are enumerated, starting with an ID of 1. The expected number of variants are as follows:
    • tax: 99
    • tictactoe: 268
    • triangle: 122
  • RunID: Each program variant was analyzed 5 times to account for variability in runtime measurements.
  • Baseline: The runtime of the baseline system.
  • MySystem: The runtime of the new system.

Additional data expectations: runtime is strictly positive and the data set is complete.

4. Validate the data set

Given the summary above, test for 3 expected properties of the data set, not counting the example assertion on number of subject programs (see Q1).

(Optional: Thoroughly validate the data set beyond 3 expected properties.)

Note: If your validation reveals any failing assertions (1) ask the course staff whether these are expected and (2) comment out the assertion and move on.

# Count unique subject names
nSubj <- length(unique(rt_data$Subject))
assert_that(3 == nSubj)

5. Transform the data from wide to long format

The output data frame should have the following columns (the order does not matter):

  • Subject
  • VariantID
  • RunID
  • Approach
  • Runtime
rt_data.long <- 

6. Aggregate runtime data

Recall that each variant was analyzed 5 times (i.e., a deterministic program was executed 5 times on the same variant with identical inputs). Aggregate each of the 5 related runtime results – using mean or median. Provide a brief justification for your choice of mean vs. median (see Q2). (Your choice may be informed by data or domain knowledge.)

rt_data.agg <-

7. Validate aggregation

assert_that(nrow(rt_data.agg) == nrow(rt_data.long)/5)

(Optional: Add additional assertions for data validation.)

8. Plot the aggregated data, using color coding and a faceting

ggplot(rt_data.agg) +
    geom_density(aes(x=Runtime, color=Approach)) +
    facet_grid(Subject~.) +
    theme_bw() + theme(legend.position="top")

Read the syntax for facet_grid as: group the data by Subject and plot each subject on a separate row. More generally, facet_grid allows you to group your data and plot these groups individually by rows or columns (Syntax: <rows> ~ <cols>). For example, the following four configurations group and render the same data in different ways:

  • facet_grid(Subject~.)
  • facet_grid(Subject~Approach)
  • facet_grid(.~Subject)
  • facet_grid(.~Subject+Approach)

A future lecture will discuss best practices for choosing a suitable visualization, depending on the underlying data and research questions.

9. Add a column for transformed data

It is reasonable to assume that the runtime data is log-normally distributed. Add a column RuntimeLog that simply takes the log of the Runtime column.

rt_data.agg <-

10. Plot transformed data

ggplot(rt_data.agg) +
    geom_density(aes(x=RuntimeLog, color=Approach)) +
    facet_grid(Subject~.) +
    theme_bw() + theme(legend.position="top")

11. Test the difference(s) – Runtime using the full data set

t <- t.test(Runtime~Approach, rt_data.agg)
d <- cohen.d(Runtime~Approach, rt_data.agg)
t.res <- tibble(subj="all", data="Linear", test="T", p=t$p.value, eff=d$estimate, eff_qual=d$magnitude)
 
u <- wilcox.test(Runtime~Approach, rt_data.agg)
a <- VD.A(Runtime~Approach, rt_data.agg)
u.res <- tibble(subj="all", data="Linear", test="U", p=u$p.value, eff=a$estimate, eff_qual=a$magnitude)

results <- bind_rows(t.res, u.res)
results

12. Test the difference(s) – Runtime vs. RuntimeLog and per subject

Extend the code above (and the results data frame): add test results for all combinations of Subject x {Runtime, RuntimeLog} x {t.test, wilcox.test}. The final results data frame should provide 16 rows – the results for each subject as well as for all subjects (see Q3 and Q4).

Note: You are not graded on coding style or code efficiency. However, try to be as concise as possible.

# Add additional rows to the results data frame

# Test for completeness
assert_that(nrow(results) == 16)

# Print the final results data frame
results

Part 2: General properties of the U test

Note: This part is independent of part 1 and not related to the runtime data set. In particular, independent samples in questions Q5 and Q6 refer to samples that you can make up (encoded manually or simulated using a common distribution) such that these samples satisfy the stated properties.

13. Code for questions Q5 and Q6

Supporting code for Q5

# Create two samples A and B

Supporting code for Q6

# Create two samples A and B

Questions (5 pts)

  • Q1 Briefly justify your choice of data validation assertions (what informed your choices)? (0.5 pts)

  • Q2 Briefly justify your choice for aggregating the runtime data. (0.5 pts)

  • Q3 How did the data transformation of the aggregated Runtime values as well as the slicing by Subject affect the outcomes of the parametric and non-parametric tests (T vs. U)? Briefly explain your observations (considering differences in p values and effect sizes). (1 pt)

  • Q4 Given your understanding of the data-generation process and your observations about the data, indicate and justify which data analysis is preferable. (Consider possible decisions such as all subjects vs. per subject, transformed vs. non-transformed data, and parametric vs. non-parametric statistics.) (1 pt)

  • Q5 Consider the non-parametric U test: Create two independent samples A and B such that (1) each sample has five observations and (2) the p value is truly minimal when comparing A and B. State the null hypothesis for this U test and visualize the two samples with a point plot. (0.5 pts)

  • Q6 Consider the non-parametric U test: Create two independent samples A and B such that the p value is significant (p<0.05) but the medians are the same. Describe your approach (with a justification) to creating the two samples and visualize the two samples. (Depending on the samples, a point plot, histogram, or density plot may be an appropriate choice). (1 pt)

  • Q7 Under what assumption(s) can the U test of independent samples be interpreted as a significance test for the median? (0.5 pts)

  • Q8 (Optional) Additional validation efforts. (up to 0.5 pts)

Deliverables (5 pts)

  1. An executable notebook with all your code and answers to the questions above. Acceptable notebook formats are:
    • Rmarkdown: .Rmd
    • Jupyter: .ipynb
    Please list all group members at the top of your notebook (or just your name if you worked alone).

You may use the starter code linked from the table of contents. Note: the ipynb notebook is an automatically converted Rmd notebook (using Quarto).

Steps for turn-in

One team member should upload the deliverables to Canvas.