suppressPackageStartupMessages({
library(tidyverse)
library(assertthat)
library(effsize)
})
In-class exercise Statistical Significance and Power: Solutions
Part 1: Parametric vs. non-parametric statistics
Instructions
1. Install (if needed) and load the following packages
In [1]:
2. Load the runtime.csv
dataset from the course website
In [2]:
<- read_csv("https://homes.cs.washington.edu/~rjust/courses/CSEP590/in_class/04_stats/data/runtime.csv", show_col_types=F) rt_data
For faster (local) exploration you may download the dataset and load it from a local file. However, make sure that your final submission reads the data from the given URL.
3. Inspect the data set
In [3]:
head(rt_data)
This dataset provides benchmark results (runtime data) for a new program analysis approach (MySystem
), compared to a baseline approach (Baseline
). Specifically, the columns provide the following information:
Subject
: One of three benchmark programs (tax, tictactoe, triangle).VariantID
: ID of a program variant. Each subject program has a different number of variants. For a given subject program, all variants are enumerated, starting with an ID of 1. The expected number of variants are as follows:- tax: 99
- tictactoe: 268
- triangle: 122
RunID
: Each program variant was analyzed 5 times to account for variability in runtime measurements.Baseline
: The runtime of the baseline system.MySystem
: The runtime of the new system.
Additional data expectations: runtime is strictly positive and the data set is complete.
4. Validate the data set
Given the summary above, test for 3 expected properties of the data set, not counting the example assertion on number of subject programs (see Q1).
(Optional: Thoroughly validate the data set beyond 3 expected properties.)
Note: If your validation reveals any failing assertions (1) ask the course staff whether these are expected and (2) comment out the assertion and move on.
In [4]:
# Count unique subject names
<- length(unique(rt_data$Subject))
nSubj assert_that(3 == nSubj)
5. Transform the data from wide to long format
The output data frame should have the following columns (the order does not matter):
Subject
VariantID
RunID
Approach
Runtime
In [5]:
<- rt_data.long
6. Aggregate runtime data
Recall that each variant was analyzed 5 times (i.e., a deterministic program was executed 5 times on the same variant with identical inputs). Aggregate each of the 5 related runtime results – using mean or median. Provide a brief justification for your choice of mean vs. median (see Q2). (Your choice may be informed by data or domain knowledge.)
In [6]:
<- rt_data.agg
7. Validate aggregation
In [7]:
assert_that(nrow(rt_data.agg) == nrow(rt_data.long)/5)
(Optional: Add additional assertions for data validation.)
8. Plot the aggregated data, using color coding and a faceting
In [8]:
ggplot(rt_data.agg) +
geom_density(aes(x=Runtime, color=Approach)) +
facet_grid(Subject~.) +
theme_bw() + theme(legend.position="top")
Read the syntax for facet_grid
as: group the data by Subject
and plot each subject on a separate row. More generally, facet_grid
allows you to group your data and plot these groups individually by rows or columns (Syntax: <rows> ~ <cols>
). For example, the following four configurations group and render the same data in different ways:
facet_grid(Subject~.)
facet_grid(Subject~Approach)
facet_grid(.~Subject)
facet_grid(.~Subject+Approach)
A future lecture will discuss best practices for choosing a suitable visualization, depending on the underlying data and research questions.
9. Add a column for transformed data
It is reasonable to assume that the runtime data is log-normally distributed. Add a column RuntimeLog
that simply takes the log
of the Runtime
column.
In [9]:
<- rt_data.agg
10. Plot transformed data
In [10]:
ggplot(rt_data.agg) +
geom_density(aes(x=RuntimeLog, color=Approach)) +
facet_grid(Subject~.) +
theme_bw() + theme(legend.position="top")
11. Test the difference(s) – Runtime
using the full data set
In [11]:
<- t.test(Runtime~Approach, rt_data.agg)
t <- cohen.d(Runtime~Approach, rt_data.agg)
d <- tibble(subj="all", data="Linear", test="T", p=t$p.value, eff=d$estimate, eff_qual=d$magnitude)
t.res
<- wilcox.test(Runtime~Approach, rt_data.agg)
u <- VD.A(Runtime~Approach, rt_data.agg)
a <- tibble(subj="all", data="Linear", test="U", p=u$p.value, eff=a$estimate, eff_qual=a$magnitude)
u.res
<- bind_rows(t.res, u.res)
results results
12. Test the difference(s) – Runtime
vs. RuntimeLog
and per subject
Extend the code above (and the results data frame): add test results for all combinations of Subject
x {Runtime, RuntimeLog}
x {t.test, wilcox.test}
. The final results data frame should provide 16 rows – the results for each subject as well as for all subjects (see Q3 and Q4).
Note: You are not graded on coding style or code efficiency. However, try to be as concise as possible.
In [12]:
# Add additional rows to the results data frame
# Test for completeness
assert_that(nrow(results) == 16)
# Print the final results data frame
results
Part 2: General properties of the U test
Note: This part is independent of part 1 and not related to the runtime data set. In particular, independent samples in questions Q5 and Q6 refer to samples that you can make up (encoded manually or simulated using a common distribution) such that these samples satisfy the stated properties.
13. Code for questions Q5 and Q6
Supporting code for Q5
In [13]:
# Create two samples A and B
Supporting code for Q6
In [14]:
# Create two samples A and B
Questions (5 pts)
Q1 Briefly justify your choice of data validation assertions (what informed your choices)? (0.5 pts)
Q2 Briefly justify your choice for aggregating the runtime data. (0.5 pts)
Q3 How did the data transformation of the aggregated
Runtime
values as well as the slicing by Subject affect the outcomes of the parametric and non-parametric tests (T vs. U)? Briefly explain your observations (considering differences in p values and effect sizes). (1 pt)Q4 Given your understanding of the data-generation process and your observations about the data, indicate and justify which data analysis is preferable. (Consider possible decisions such as all subjects vs. per subject, transformed vs. non-transformed data, and parametric vs. non-parametric statistics.) (1 pt)
Q5 Consider the non-parametric U test: Create two independent samples A and B such that (1) each sample has five observations and (2) the p value is truly minimal when comparing A and B. State the null hypothesis for this U test and visualize the two samples with a point plot. (0.5 pts)
Q6 Consider the non-parametric U test: Create two independent samples A and B such that the p value is significant (p<0.05) but the medians are the same. Describe your approach (with a justification) to creating the two samples and visualize the two samples. (Depending on the samples, a point plot, histogram, or density plot may be an appropriate choice). (1 pt)
Q7 Under what assumption(s) can the U test of independent samples be interpreted as a significance test for the median? (0.5 pts)
Q8 (Optional) Additional validation efforts. (up to 0.5 pts)