library(tidyverse)
In-class exercise Statistical Modeling: Solutions
CLT is more than an airport in NC – the Central Limit Theorem
Instructions
- Install (if needed) and load the following packages:
In [1]:
- Set the random seed to 1.
In [2]:
set.seed(1)
- Create a population: 1M normally distributed values. Set mean and standard deviation such that about 96% of all values are between 80 and 120. Test whether the created population meets the stated criteria (see Q1). It is sufficient to ballpark the output of a simple calculation, but the output should appear in your solution.
In [3]:
<- pop
From this population, draw 10 samples – each with 10 values from the population (sampled without replacement).
Create a tibble in long format with two columns: (1)
Sample
and (2)Value
. MakeSample
a factor (represents the sample id).
In [4]:
# Number of samples
<- 10
S # Sample size for each sample
<- 10
N # Construct tibble for all samples
<- samples
- Plot the population (black line) and all samples (color-coded) using a density plot.
In [5]:
ggplot() + geom_density(data=samples, aes(x=Value, color=Sample)) +
geom_density(data=pop, aes(x=Value), color="black") +
theme_bw() + theme(legend.position="none")
- Print the mean and standard deviation of the population and the means and standard deviations of each of the samples. Compare the values between the population and the samples (plot and printed values) – note your observations (see Q3).
In [6]:
# Mean and sd of population
mean(...)
sd(...)
# Mean and sd of each sample
- Create a tibble with two columns: (1)
Sample
and (2)Mean
that contains the mean of each sample (one row per sample). What is the name of this new distribution of sample means (see Q2)?
In [7]:
<- samples.means
Print the mean and standard deviation of the population and the mean and standard deviation of the distribution of sample means. Note your observations (see Q3).
Plot the population (black line) and all sample means (red line) using a density plot.
In [8]:
ggplot() + geom_density(data=samples.means, aes(x=Mean), color="red") +
geom_density(data=pop, aes(x=Value), color="black") +
theme_bw() + theme(legend.position="none")
Increase
S
to 100 andN
from 10 to 30, and then to 100. For each value ofN
, repeat steps 9 and 10 and note your observations (see Q4). (KeepS=100
andN=100
going forward.)Print the mean and the standard error (sd/sqrt(N)) of the first sample and compare the outputs with those of step 9. Note your observations (see Q3).
Questions
Q1 Briefly explain your test in step 3.
Q2 In your own words, describe the terms population, sample, and sampling distribution.
Q3 What did you observe when comparing the mean(s) and standard deviation(s) between the three distributions (population, first sample, sampling distribution)? Briefly explain the relationship between these three distributions (w.r.t. mean, standard deviation, and standard error).
Q4 What did you observe when increasing N, what are the implications?
Q5 (Optional) Repeat the exercise by changing the distribution of the population.
Q6 (Optional) Create an animated visualization that shows how the sample size
N
affects the sample distributions and the sampling distribution.