Back to Article
Starter code: ipynb notebook
Download Notebook

CLT is more than an airport in NC – the Central Limit Theorem

Instructions

  1. Install (if needed) and load the following packages:
In [1]:
library(tidyverse)
  1. Set the random seed to 1.
In [3]:
set.seed(1)
  1. Create a population: 1M normally distributed values. Set mean and standard deviation such that about 96% of all values are between 80 and 120. Test whether the created population meets the stated criteria (see Q1). It is sufficient to ballpark the output of a simple calculation, but the output should appear in your solution.
In [5]:
pop <- 
  1. From this population, draw 10 samples – each with 10 values from the population (sampled without replacement).

  2. Create a tibble in long format with two columns: (1) Sample and (2) Value. Make Sample a factor (represents the sample id).

In [7]:
# Number of samples
S <- 10
# Sample size for each sample
N <- 10
# Construct tibble for all samples
samples <-
  1. Plot the population (black line) and all samples (color-coded) using a density plot.
In [9]:
ggplot() + geom_density(data=samples, aes(x=Value, color=Sample)) +
           geom_density(data=pop, aes(x=Value), color="black") +
           theme_bw() + theme(legend.position="none")
  1. Print the mean and standard deviation of the population and the means and standard deviations of each of the samples. Compare the values between the population and the samples (plot and printed values) – note your observations (see Q3).
In [11]:
# Mean and sd of population
mean(...)
sd(...)
# Mean and sd of each sample
  1. Create a tibble with two columns: (1) Sample and (2) Mean that contains the mean of each sample (one row per sample). What is the name of this new distribution of sample means (see Q2)?
In [13]:
samples.means <- 
  1. Print the mean and standard deviation of the population and the mean and standard deviation of the distribution of sample means. Note your observations (see Q3).

  2. Plot the population (black line) and all sample means (red line) using a density plot.

In [15]:
ggplot() + geom_density(data=samples.means, aes(x=Mean), color="red") +
           geom_density(data=pop, aes(x=Value), color="black") +
           theme_bw() + theme(legend.position="none")
  1. Increase S to 100 and N from 10 to 30, and then to 100. For each value of N, repeat steps 9 and 10 and note your observations (see Q4). (Keep S=100 and N=100 going forward.)

  2. Print the mean and the standard error (sd/sqrt(N)) of the first sample and compare the outputs with those of step 9. Note your observations (see Q3).

Questions

  • Q1 Briefly explain your test in step 3.

  • Q2 In your own words, describe the terms population, sample, and sampling distribution.

  • Q3 What did you observe when comparing the mean(s) and standard deviation(s) between the three distributions (population, first sample, sampling distribution)? Briefly explain the relationship between these three distributions (w.r.t. mean, standard deviation, and standard error).

  • Q4 What did you observe when increasing N, what are the implications?

  • Q5 (Optional) Repeat the exercise by changing the distribution of the population.

  • Q6 (Optional) Create an animated visualization that shows how the sample size N affects the sample distributions and the sampling distribution.