--- title: "In-class exercise Statistical Significance and Power: Solutions" author: "Name1, Name2" format: html jupyter: ir engine: knitr # Set to true to evaluate the code cells when rendering the notebook eval: false # Embed notebook view, together with a download link of the Rmd notebook notebook-view: - notebook: in_class4.Rmd title: "Statistical Significance and Power: Rmd notebook" --- ## Part 1: Parametric vs. non-parametric statistics ### Instructions #### 1. Install (if needed) and load the following packages ```{r setup} suppressPackageStartupMessages({ library(tidyverse) library(assertthat) library(effsize) }) ``` #### 2. Load the `runtime.csv` dataset from the course website ```{r dataset} rt_data <- read_csv("https://homes.cs.washington.edu/~rjust/courses/CSEP590/in_class/04_stats/data/runtime.csv", show_col_types=F) ``` For faster (local) exploration you may download the dataset and load it from a local file. *However, make sure that your final submission reads the data from the given URL.* #### 3. Inspect the data set ```{r inspect} head(rt_data) ``` This dataset provides benchmark results (runtime data) for a new program analysis approach (`MySystem`), compared to a baseline approach (`Baseline`). Specifically, the columns provide the following information: * `Subject`: One of three benchmark programs (*tax*, *tictactoe*, *triangle*). * `VariantID`: ID of a program variant. *Each subject program has a different number of variants*. For a given subject program, all variants are enumerated, starting with an ID of 1. The expected number of variants are as follows: - *tax*: 99 - *tictactoe*: 268 - *triangle*: 122 * `RunID`: *Each* program *variant* was *analyzed 5 times* to account for variability in runtime measurements. * `Baseline`: The *runtime* of the *baseline system*. * `MySystem`: The *runtime* of the *new system*. Additional data expectations: runtime is strictly positive and the data set is complete. #### 4. Validate the data set Given the summary above, test for 3 expected properties of the data set, not counting the example assertion on number of subject programs (see Q1). (Optional: Thoroughly validate the data set beyond 3 expected properties.) *Note: If your validation reveals any failing assertions (1) ask the course staff whether these are expected and (2) comment out the assertion and move on.* ```{r validate-data} # Count unique subject names nSubj <- length(unique(rt_data$Subject)) assert_that(3 == nSubj) ``` #### 5. Transform the data from wide to long format The output data frame should have the following columns (the order does not matter): * `Subject` * `VariantID` * `RunID` * `Approach` * `Runtime` ```{r tidy-data} rt_data.long <- ``` #### 6. Aggregate runtime data Recall that each variant was analyzed 5 times (i.e., a deterministic program was executed 5 times on the same variant with identical inputs). Aggregate each of the 5 related runtime results -- using mean or median. *Provide a brief justification for your choice of mean vs. median (see Q2).* (Your choice may be informed by data or domain knowledge.) ```{r agg-data} rt_data.agg <- ``` #### 7. Validate aggregation ```{r validate-agg} assert_that(nrow(rt_data.agg) == nrow(rt_data.long)/5) ``` (Optional: Add additional assertions for data validation.) #### 8. Plot the aggregated data, using color coding and a faceting ```{r plot-rt} ggplot(rt_data.agg) + geom_density(aes(x=Runtime, color=Approach)) + facet_grid(Subject~.) + theme_bw() + theme(legend.position="top") ``` Read the syntax for `facet_grid` as: group the data by `Subject` and plot each subject on a separate row. More generally, `facet_grid` allows you to group your data and plot these groups individually by rows or columns (Syntax: ` ~ `). For example, the following four configurations group and render the same data in different ways: * `facet_grid(Subject~.)` * `facet_grid(Subject~Approach)` * `facet_grid(.~Subject)` * `facet_grid(.~Subject+Approach)` A future lecture will discuss best practices for choosing a suitable visualization, depending on the underlying data and research questions. #### 9. Add a column for transformed data It is reasonable to assume that the runtime data is log-normally distributed. Add a column `RuntimeLog` that simply takes the `log` of the `Runtime` column. ```{r log} rt_data.agg <- ``` #### 10. Plot transformed data ```{r plot-rt-log} ggplot(rt_data.agg) + geom_density(aes(x=RuntimeLog, color=Approach)) + facet_grid(Subject~.) + theme_bw() + theme(legend.position="top") ``` #### 11. Test the difference(s) -- `Runtime` using the full data set ```{r test-linear-all} t <- t.test(Runtime~Approach, rt_data.agg) d <- cohen.d(Runtime~Approach, rt_data.agg) t.res <- tibble(subj="all", data="Linear", test="T", p=t$p.value, eff=d$estimate, eff_qual=d$magnitude) u <- wilcox.test(Runtime~Approach, rt_data.agg) a <- VD.A(Runtime~Approach, rt_data.agg) u.res <- tibble(subj="all", data="Linear", test="U", p=u$p.value, eff=a$estimate, eff_qual=a$magnitude) results <- bind_rows(t.res, u.res) results ``` #### 12. Test the difference(s) -- `Runtime` vs. `RuntimeLog` and per subject Extend the code above (and the results data frame): add test results for all combinations of `Subject` x `{Runtime, RuntimeLog}` x `{t.test, wilcox.test}`. The final results data frame should provide 16 rows -- the results for *each subject as well as for all subjects (see Q3 and Q4)*. *Note: You are not graded on coding style or code efficiency. However, try to be as concise as possible.* ```{r test-groups} # Add additional rows to the results data frame # Test for completeness assert_that(nrow(results) == 16) # Print the final results data frame results ``` ## Part 2: General properties of the U test *Note: This part is independent of part 1 and not related to the runtime data set.* In particular, *independent samples* in questions Q5 and Q6 refer to samples that you can make up (encoded manually or simulated using a common distribution) such that these samples satisfy the stated properties. #### 13. Code for questions Q5 and Q6 Supporting code for Q5 ```{r u-test-q5} # Create two samples A and B ``` Supporting code for Q6 ```{r u-test-q6} # Create two samples A and B ``` ### Questions *(5 pts)* * Q1 Briefly justify your choice of data validation assertions (what informed your choices)? *(0.5 pts)* * Q2 Briefly justify your choice for aggregating the runtime data. *(0.5 pts)* * Q3 How did the data transformation of the aggregated `Runtime` values as well as the slicing by Subject affect the outcomes of the parametric and non-parametric tests (T vs. U)? Briefly explain your observations (considering differences in p values and effect sizes). *(1 pt)* * Q4 Given your understanding of the data-generation process and your observations about the data, indicate and justify which data analysis is preferable. (Consider possible decisions such as all subjects vs. per subject, transformed vs. non-transformed data, and parametric vs. non-parametric statistics.) *(1 pt)* * Q5 Consider the non-parametric U test: Create two independent samples A and B such that (1) each sample has five observations and (2) the p value is truly minimal when comparing A and B. State the null hypothesis for this U test and visualize the two samples with a point plot. *(0.5 pts)* * Q6 Consider the non-parametric U test: Create two independent samples A and B such that the p value is significant (p<0.05) but the medians are the same. Describe your approach (with a justification) to creating the two samples and visualize the two samples. (Depending on the samples, a point plot, histogram, or density plot may be an appropriate choice). *(1 pt)* * Q7 Under what assumption(s) can the U test of independent samples be interpreted as a significance test for the median? *(0.5 pts)* * Q8 (Optional) Additional validation efforts. *(up to 0.5 pts)*