In-class exercise Statistical Modeling: Instructions

High-level goal

The high-level goal of this exercise is twofold: (1) deepen your understanding of the CLT and (2) gain experience with simulations and working with common distributions in R.

Set up

  • Team up in groups of size 2, and self-assign to a group (In-class-3-stats-modeling) on Canvas. (If you are in a Canvas group of size 1, you can still submit.) In the past, groups found a pair-programming set up (in person or using screen-sharing for remote work) to be beneficial.

  • Set up R: Any one of the following options is sufficient for this exercise:

    1. Local: Install R on your machine.
    2. Local: Install RStudio on your machine.
    3. Web-based: Use a computational notebook such as Jupyter or Google Colab.
    4. Web-based: Use the WebR REPL (convenient but still early release).

    The (free) web-based options offer a very convenient way of exploring R. They are sufficient for most exercises, but you may need additional resources for some of the big-data analyses towards the end of the course.

CLT is more than an airport in NC – the Central Limit Theorem

Instructions

  1. Install (if needed) and load the following packages:
library(tidyverse)
  1. Set the random seed to 1.
set.seed(1)
  1. Create a population: 1M normally distributed values. Set mean and standard deviation such that about 96% of all values are between 80 and 120. Test whether the created population meets the stated criteria (see Q1). It is sufficient to ballpark the output of a simple calculation, but the output should appear in your solution.
pop <- 
  1. From this population, draw 10 samples – each with 10 values from the population (sampled without replacement).

  2. Create a tibble in long format with two columns: (1) Sample and (2) Value. Make Sample a factor (represents the sample id).

# Number of samples
S <- 10
# Sample size for each sample
N <- 10
# Construct tibble for all samples
samples <-
  1. Plot the population (black line) and all samples (color-coded) using a density plot.
ggplot() + geom_density(data=samples, aes(x=Value, color=Sample)) +
           geom_density(data=pop, aes(x=Value), color="black") +
           theme_bw() + theme(legend.position="none")
  1. Print the mean and standard deviation of the population and the means and standard deviations of each of the samples. Compare the values between the population and the samples (plot and printed values) – note your observations (see Q3).
# Mean and sd of population
mean(...)
sd(...)
# Mean and sd of each sample
  1. Create a tibble with two columns: (1) Sample and (2) Mean that contains the mean of each sample (one row per sample). What is the name of this new distribution of sample means (see Q2)?
samples.means <- 
  1. Print the mean and standard deviation of the population and the mean and standard deviation of the distribution of sample means. Note your observations (see Q3).

  2. Plot the population (black line) and all sample means (red line) using a density plot.

ggplot() + geom_density(data=samples.means, aes(x=Mean), color="red") +
           geom_density(data=pop, aes(x=Value), color="black") +
           theme_bw() + theme(legend.position="none")
  1. Increase S to 100 and N from 10 to 30, and then to 100. For each value of N, repeat steps 9 and 10 and note your observations (see Q4). (Keep S=100 and N=100 going forward.)

  2. Print the mean and the standard error (sd/sqrt(N)) of the first sample and compare the outputs with those of step 9. Note your observations (see Q3).

Questions

  • Q1 Briefly explain your test in step 3.

  • Q2 In your own words, describe the terms population, sample, and sampling distribution.

  • Q3 What did you observe when comparing the mean(s) and standard deviation(s) between the three distributions (population, first sample, sampling distribution)? Briefly explain the relationship between these three distributions (w.r.t. mean, standard deviation, and standard error).

  • Q4 What did you observe when increasing N, what are the implications?

  • Q5 (Optional) Repeat the exercise by changing the distribution of the population.

  • Q6 (Optional) Create an animated visualization that shows how the sample size N affects the sample distributions and the sampling distribution.

Deliverables

  1. An executable notebook with all your code and answers to the questions above. Acceptable notebook formats are:
    • Rmarkdown: .Rmd
    • Jupyter: .ipynb
    Please list all group members at the top of your notebook (or just your name if you worked alone).

You may use the starter code linked from the table of contents. Note: the ipynb notebook is an automatically converted Rmd notebook (using Quarto).

Steps for turn-in

One team member should upload the deliverables to Canvas.