Back to Article
Starter code: Rmd notebook
Download Notebook

In-class exercise Statistical Significance and Power: Solutions

Author

Name1, Name2

Part 1: Parametric vs. non-parametric statistics

Instructions

1. Install (if needed) and load the following packages

In [1]:
suppressPackageStartupMessages({
  library(tidyverse)
  library(assertthat)
  library(effsize)
})

2. Load the runtime.csv dataset from the course website

In [2]:
rt_data <- read_csv("https://homes.cs.washington.edu/~rjust/courses/CSEP590/in_class/04_stats/data/runtime.csv", show_col_types=F) 

For faster (local) exploration you may download the dataset and load it from a local file. However, make sure that your final submission reads the data from the given URL.

3. Inspect the data set

In [3]:
head(rt_data)

This dataset provides benchmark results (runtime data) for a new program analysis approach (MySystem), compared to a baseline approach (Baseline). Specifically, the columns provide the following information:

  • Subject: One of three benchmark programs (tax, tictactoe, triangle).
  • VariantID: ID of a program variant. Each subject program has a different number of variants. For a given subject program, all variants are enumerated, starting with an ID of 1. The expected number of variants are as follows:
    • tax: 99
    • tictactoe: 268
    • triangle: 122
  • RunID: Each program variant was analyzed 5 times to account for variability in runtime measurements.
  • Baseline: The runtime of the baseline system.
  • MySystem: The runtime of the new system.

Additional data expectations: runtime is strictly positive and the data set is complete.

4. Validate the data set

Given the summary above, test for 3 expected properties of the data set, not counting the example assertion on number of subject programs (see Q1).

(Optional: Thoroughly validate the data set beyond 3 expected properties.)

Note: If your validation reveals any failing assertions (1) ask the course staff whether these are expected and (2) comment out the assertion and move on.

In [4]:
# Count unique subject names
nSubj <- length(unique(rt_data$Subject))
assert_that(3 == nSubj)

5. Transform the data from wide to long format

The output data frame should have the following columns (the order does not matter):

  • Subject
  • VariantID
  • RunID
  • Approach
  • Runtime
In [5]:
rt_data.long <- 

6. Aggregate runtime data

Recall that each variant was analyzed 5 times (i.e., a deterministic program was executed 5 times on the same variant with identical inputs). Aggregate each of the 5 related runtime results – using mean or median. Provide a brief justification for your choice of mean vs. median (see Q2). (Your choice may be informed by data or domain knowledge.)

In [6]:
rt_data.agg <-

7. Validate aggregation

In [7]:
assert_that(nrow(rt_data.agg) == nrow(rt_data.long)/5)

(Optional: Add additional assertions for data validation.)

8. Plot the aggregated data, using color coding and a faceting

In [8]:
ggplot(rt_data.agg) +
    geom_density(aes(x=Runtime, color=Approach)) +
    facet_grid(Subject~.) +
    theme_bw() + theme(legend.position="top")

Read the syntax for facet_grid as: group the data by Subject and plot each subject on a separate row. More generally, facet_grid allows you to group your data and plot these groups individually by rows or columns (Syntax: <rows> ~ <cols>). For example, the following four configurations group and render the same data in different ways:

  • facet_grid(Subject~.)
  • facet_grid(Subject~Approach)
  • facet_grid(.~Subject)
  • facet_grid(.~Subject+Approach)

A future lecture will discuss best practices for choosing a suitable visualization, depending on the underlying data and research questions.

9. Add a column for transformed data

It is reasonable to assume that the runtime data is log-normally distributed. Add a column RuntimeLog that simply takes the log of the Runtime column.

In [9]:
rt_data.agg <-

10. Plot transformed data

In [10]:
ggplot(rt_data.agg) +
    geom_density(aes(x=RuntimeLog, color=Approach)) +
    facet_grid(Subject~.) +
    theme_bw() + theme(legend.position="top")

11. Test the difference(s) – Runtime using the full data set

In [11]:
t <- t.test(Runtime~Approach, rt_data.agg)
d <- cohen.d(Runtime~Approach, rt_data.agg)
t.res <- tibble(subj="all", data="Linear", test="T", p=t$p.value, eff=d$estimate, eff_qual=d$magnitude)
 
u <- wilcox.test(Runtime~Approach, rt_data.agg)
a <- VD.A(Runtime~Approach, rt_data.agg)
u.res <- tibble(subj="all", data="Linear", test="U", p=u$p.value, eff=a$estimate, eff_qual=a$magnitude)

results <- bind_rows(t.res, u.res)
results

12. Test the difference(s) – Runtime vs. RuntimeLog and per subject

Extend the code above (and the results data frame): add test results for all combinations of Subject x {Runtime, RuntimeLog} x {t.test, wilcox.test}. The final results data frame should provide 16 rows – the results for each subject as well as for all subjects (see Q3 and Q4).

Note: You are not graded on coding style or code efficiency. However, try to be as concise as possible.

In [12]:
# Add additional rows to the results data frame

# Test for completeness
assert_that(nrow(results) == 16)

# Print the final results data frame
results

Part 2: General properties of the U test

Note: This part is independent of part 1 and not related to the runtime data set. In particular, independent samples in questions Q5 and Q6 refer to samples that you can make up (encoded manually or simulated using a common distribution) such that these samples satisfy the stated properties.

13. Code for questions Q5 and Q6

Supporting code for Q5

In [13]:
# Create two samples A and B

Supporting code for Q6

In [14]:
# Create two samples A and B

Questions (5 pts)

  • Q1 Briefly justify your choice of data validation assertions (what informed your choices)? (0.5 pts)

  • Q2 Briefly justify your choice for aggregating the runtime data. (0.5 pts)

  • Q3 How did the data transformation of the aggregated Runtime values as well as the slicing by Subject affect the outcomes of the parametric and non-parametric tests (T vs. U)? Briefly explain your observations (considering differences in p values and effect sizes). (1 pt)

  • Q4 Given your understanding of the data-generation process and your observations about the data, indicate and justify which data analysis is preferable. (Consider possible decisions such as all subjects vs. per subject, transformed vs. non-transformed data, and parametric vs. non-parametric statistics.) (1 pt)

  • Q5 Consider the non-parametric U test: Create two independent samples A and B such that (1) each sample has five observations and (2) the p value is truly minimal when comparing A and B. State the null hypothesis for this U test and visualize the two samples with a point plot. (0.5 pts)

  • Q6 Consider the non-parametric U test: Create two independent samples A and B such that the p value is significant (p<0.05) but the medians are the same. Describe your approach (with a justification) to creating the two samples and visualize the two samples. (Depending on the samples, a point plot, histogram, or density plot may be an appropriate choice). (1 pt)

  • Q7 Under what assumption(s) can the U test of independent samples be interpreted as a significance test for the median? (0.5 pts)

  • Q8 (Optional) Additional validation efforts. (up to 0.5 pts)