In-class exercise: CLT and NHST

Overview

The high-level goals of this exercise are: (1) deepen your understanding of the CLT, (2) gain experience with simulations and working with common distributions in R, and (3) deepen your understanding of parametric and non-parametric statistics in the NHST framework.

Team up in groups of size 2-3, and self-assign to a group (In-class-CLT-NHST) on Canvas. (Only one group member needs to submit; if you are in a Canvas group of size 1, you can still submit.)

This collaborative exercise has three parts:

  • Respond to the β€œQuestions” at the end of each part and submit your responses on Canvas.

  • If you get stuck on a question or anything is unclear while completing the exercise ask the course staff for clarification.

Code cell short cuts:

  • cmd + return: execute current line (step-by-step execution)

  • shift + return: execute current cell

Fallback live notebook

If your browser does not support this live notebook, you can use the following alternative notebook.

Required packages

Note

All required packages are automatically loaded.

library(tidyverse)

Part 1: CLT is more than an airport in NC – the Central Limit Theorem

Instructions

  1. Set the random seed to 1.

(It is good practice to do so for reproducibility. The actual value does not and should not matter. It is set to β€œ1” because the exercise refers to specific samples below.)

  1. Create a population:
  • 1M normally distributed values.
  • Set mean and standard deviation such that about 96% of all values are between 80 and 120.
  • Make pop a tibble with 1M rows and a single column Value.

Test whether the created population meets the stated criteria (see Q1). It is sufficient to ballpark the output of a simple calculation, but the output should appear in your solution.

  1. From this population, draw 10 samples – each with 10 observations (i.e., 10 values from the population, sampled without replacement).

    Hint: Use the sample function.

  2. Make samples a tidy tibble (long format) with two columns: (1) Sample and (2) Value. Make the Sample column a factor (that represents the sample id).

  1. Plot the population (black line) and all samples (color-coded) using a density plot.
  1. Print the mean and standard deviation of the population and the means and standard deviations of each of the samples. Compare the values between the population and the samples (plot and printed values) – note your observations (see Q3).
  1. Create a tibble with two columns: (1) Sample and (2) Mean that contains the mean of each sample (one row per sample). What is the name of this new distribution of sample means (see Q2)?
  1. Print the mean and standard deviation of the population and the mean and standard deviation of the distribution of sample means. Note your observations (see Q3).
  1. Plot the population (black line) and all sample means (red line) using a density plot.
  1. Increase S to 100 and N from 10 to 30, and then to 100. For each value of N, repeat steps 9 and 10 and note your observations (see Q4). (Keep S=100 and N=100 going forward.)

  2. Print the mean and the standard error (sd/sqrt(N)) of the first sample and compare the outputs with those of step 9. Note your observations (see Q3).

Questions

  • Q1 Briefly explain your test in step 2.

  • Q2 In your own words, describe the terms population, sample, and sampling distribution.

  • Q3 What did you observe when comparing the mean(s) and standard deviation(s) between the three distributions (population, first sample, sampling distribution)? Briefly explain the relationship between these three distributions (w.r.t. mean, standard deviation, and standard error).

  • Q4 What did you observe when increasing N, what are the implications?

  • Q5 (Optional) Repeat the exercise by changing the distribution of the population.

  • Q6 (Optional) Create an animated visualization that shows how the sample size N affects the sample distributions and the sampling distribution.

Part 2: The non-parametric U test

Note: This part is independent of part 1. In particular, independent samples in questions Q7 and Q8 refer to samples that you can make up (encoded manually or simulated using a common distribution) such that these samples satisfy the stated criteria.

Code for questions Q7 and Q8

Supporting code for Q7

Supporting code for Q8

Questions

  • Q7 Consider the non-parametric U test: Create two independent samples A and B such that (1) each sample has five observations and (2) the p value is truly minimal when comparing A and B. State the null hypothesis for this U test and visualize the two samples with a point plot.

  • Q8 Consider the non-parametric U test: Create two independent samples A and B such that the p value is significant (p<0.05) but the medians are the same. Describe your approach (with a justification) to creating the two samples and visualize the two samples. (Depending on the samples, a point plot, histogram, or density plot may be an appropriate choice).

  • Q9 Under what assumption(s) can the U test of independent samples be interpreted as a significance test for the difference in medians?

Part 3: Testing your understanding

Questions

  • Q10 Which of the following seven statements are wrong? Briefly explain why.
    • The CLT only applies to distributions with finite variance.

    • The CLT is only applicable if the sample size is at least 30.

    • For a large enough sample size, the sampling distribution of the sample means tends to be normally distributed.

    • For a large enough sample size, the sampling distribution of any sample statistic tends to be normally distributed.

    • For a large enough sample size, the distribution of a sample from a population with finite variance tends to be normally distributed.

    • The mean of the sample means converges to the mean of the population.

    • The standard deviation of the sample means is approximately equal to standard deviation of the population.