In-class exercise Statistical Significance and Power: Solutions

Author

Name1, Name2

Part 1: Parametric vs. non-parametric statistics

Instructions

1. Install (if needed) and load the following packages

In [1]:

suppressPackageStartupMessages({
  library(tidyverse)
  library(assertthat)
  library(effsize)
})

2. Load the `runtime.csv` dataset from the course website

In [2]:

rt_data <- read_csv("https://homes.cs.washington.edu/~rjust/courses/CSEP590/in_class/04_stats/data/runtime.csv", show_col_types=F)

For faster (local) exploration you may download the dataset and load it from a local file. However, make sure that your final submission reads the data from the given URL.

3. Inspect the data set

In [3]:

head(rt_data)

This dataset provides benchmark results (runtime data) for a new program analysis approach (MySystem), compared to a baseline approach (Baseline). Specifically, the columns provide the following information:

Subject: One of three benchmark programs (tax, tictactoe, triangle).
VariantID: ID of a program variant. Each subject program has a different number of variants. For a given subject program, all variants are enumerated, starting with an ID of 1. The expected number of variants are as follows:
- tax: 99
- tictactoe: 268
- triangle: 122
RunID: Each program variant was analyzed 5 times to account for variability in runtime measurements.
Baseline: The runtime of the baseline system.
MySystem: The runtime of the new system.

Additional data expectations: runtime is strictly positive and the data set is complete.

4. Validate the data set

Given the summary above, test for 3 expected properties of the data set, not counting the example assertion on number of subject programs (see Q1).

(Optional: Thoroughly validate the data set beyond 3 expected properties.)

Note: If your validation reveals any failing assertions (1) ask the course staff whether these are expected and (2) comment out the assertion and move on.

In [4]:

# Count unique subject names
nSubj <- length(unique(rt_data$Subject))
assert_that(3 == nSubj)

5. Transform the data from wide to long format

The output data frame should have the following columns (the order does not matter):

Subject
VariantID
RunID
Approach
Runtime

In [5]:

rt_data.long <-

6. Aggregate runtime data

Recall that each variant was analyzed 5 times (i.e., a deterministic program was executed 5 times on the same variant with identical inputs). Aggregate each of the 5 related runtime results – using mean or median. Provide a brief justification for your choice of mean vs. median (see Q2). (Your choice may be informed by data or domain knowledge.)

In [6]:

rt_data.agg <-

7. Validate aggregation

In [7]:

assert_that(nrow(rt_data.agg) == nrow(rt_data.long)/5)

(Optional: Add additional assertions for data validation.)

8. Plot the aggregated data, using color coding and a faceting

In [8]:

ggplot(rt_data.agg) +
    geom_density(aes(x=Runtime, color=Approach)) +
    facet_grid(Subject~.) +
    theme_bw() + theme(legend.position="top")

Read the syntax for facet_grid as: group the data by Subject and plot each subject on a separate row. More generally, facet_grid allows you to group your data and plot these groups individually by rows or columns (Syntax: <rows> ~ <cols>). For example, the following four configurations group and render the same data in different ways:

facet_grid(Subject~.)
facet_grid(Subject~Approach)
facet_grid(.~Subject)
facet_grid(.~Subject+Approach)

A future lecture will discuss best practices for choosing a suitable visualization, depending on the underlying data and research questions.

9. Add a column for transformed data

It is reasonable to assume that the runtime data is log-normally distributed. Add a column RuntimeLog that simply takes the log of the Runtime column.

In [9]:

rt_data.agg <-

10. Plot transformed data

In [10]:

ggplot(rt_data.agg) +
    geom_density(aes(x=RuntimeLog, color=Approach)) +
    facet_grid(Subject~.) +
    theme_bw() + theme(legend.position="top")

11. Test the difference(s) – `Runtime` using the full data set

In [11]:

t <- t.test(Runtime~Approach, rt_data.agg)
d <- cohen.d(Runtime~Approach, rt_data.agg)
t.res <- tibble(subj="all", data="Linear", test="T", p=t$p.value, eff=d$estimate, eff_qual=d$magnitude)
 
u <- wilcox.test(Runtime~Approach, rt_data.agg)
a <- VD.A(Runtime~Approach, rt_data.agg)
u.res <- tibble(subj="all", data="Linear", test="U", p=u$p.value, eff=a$estimate, eff_qual=a$magnitude)

results <- bind_rows(t.res, u.res)
results

12. Test the difference(s) – `Runtime` vs. `RuntimeLog` and per subject

Extend the code above (and the results data frame): add test results for all combinations of Subject x {Runtime, RuntimeLog} x {t.test, wilcox.test}. The final results data frame should provide 16 rows – the results for each subject as well as for all subjects (see Q3 and Q4).

Note: You are not graded on coding style or code efficiency. However, try to be as concise as possible.

In [12]:

# Add additional rows to the results data frame

# Test for completeness
assert_that(nrow(results) == 16)

# Print the final results data frame
results

Part 2: General properties of the U test

Note: This part is independent of part 1 and not related to the runtime data set. In particular, independent samples in questions Q5 and Q6 refer to samples that you can make up (encoded manually or simulated using a common distribution) such that these samples satisfy the stated properties.

13. Code for questions Q5 and Q6

Supporting code for Q5

In [13]:

# Create two samples A and B

Supporting code for Q6

In [14]:

# Create two samples A and B

Questions (5 pts)

Q1 Briefly justify your choice of data validation assertions (what informed your choices)? (0.5 pts)
Q2 Briefly justify your choice for aggregating the runtime data. (0.5 pts)
Q3 How did the data transformation of the aggregated Runtime values as well as the slicing by Subject affect the outcomes of the parametric and non-parametric tests (T vs. U)? Briefly explain your observations (considering differences in p values and effect sizes). (1 pt)
Q4 Given your understanding of the data-generation process and your observations about the data, indicate and justify which data analysis is preferable. (Consider possible decisions such as all subjects vs. per subject, transformed vs. non-transformed data, and parametric vs. non-parametric statistics.) (1 pt)
Q5 Consider the non-parametric U test: Create two independent samples A and B such that (1) each sample has five observations and (2) the p value is truly minimal when comparing A and B. State the null hypothesis for this U test and visualize the two samples with a point plot. (0.5 pts)
Q6 Consider the non-parametric U test: Create two independent samples A and B such that the p value is significant (p<0.05) but the medians are the same. Describe your approach (with a justification) to creating the two samples and visualize the two samples. (Depending on the samples, a point plot, histogram, or density plot may be an appropriate choice). (1 pt)
Q7 Under what assumption(s) can the U test of independent samples be interpreted as a significance test for the median? (0.5 pts)
Q8 (Optional) Additional validation efforts. (up to 0.5 pts)

Starter code: Rmd notebook