{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# In-class exercise Statistical Modeling: Solutions\n", "Author: Name1, Name2\n", "---\n", "\n", "# CLT is more than an airport in NC -- the Central Limit Theorem\n", "\n", "### Instructions\n", "\n", "1. Install (if needed) and load the following packages:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "library(tidyverse)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Set the random seed to 1." ] }, { "cell_type": "code", "metadata": {}, "source": [ "set.seed(1)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Create a population: 1M normally distributed values. Set mean and standard\n", " deviation such that about 96% of all values are between 80 and 120.\n", " *Test whether the created population meets the stated criteria* (see Q1). It is\n", " sufficient to ballpark the output of a simple calculation, but the output\n", " should appear in your solution." ] }, { "cell_type": "code", "metadata": {}, "source": [ "pop <- " ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. From this population, draw 10 samples -- each with 10 values from the\n", " population (sampled without replacement).\n", "\n", "5. Create a tibble in long format with two columns: (1) `Sample` and (2) `Value`.\n", " Make `Sample` a factor (represents the sample id)." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Number of samples\n", "S <- 10\n", "# Sample size for each sample\n", "N <- 10\n", "# Construct tibble for all samples\n", "samples <-" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Plot the population (black line) and all samples (color-coded) using a density\n", " plot." ] }, { "cell_type": "code", "metadata": {}, "source": [ "ggplot() + geom_density(data=samples, aes(x=Value, color=Sample)) +\n", " geom_density(data=pop, aes(x=Value), color=\"black\") +\n", " theme_bw() + theme(legend.position=\"none\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Print the mean and standard deviation of the population and the means and\n", " standard deviations of each of the samples.\n", " Compare the values between the population and the samples (plot and printed\n", " values) -- note your observations (see Q3)." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Mean and sd of population\n", "mean(...)\n", "sd(...)\n", "# Mean and sd of each sample" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. Create a tibble with two columns: (1) `Sample` and (2) `Mean` that contains\n", " the mean of each sample (one row per sample). What is the name of this new\n", " distribution of sample means (see Q2)?" ] }, { "cell_type": "code", "metadata": {}, "source": [ "samples.means <- " ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "9. Print the mean and standard deviation of the population and the mean and\n", " standard deviation of the distribution of sample means. Note your\n", " observations (see Q3).\n", "\n", "10. Plot the population (black line) and all sample means (red line) using a\n", " density plot." ] }, { "cell_type": "code", "metadata": {}, "source": [ "ggplot() + geom_density(data=samples.means, aes(x=Mean), color=\"red\") +\n", " geom_density(data=pop, aes(x=Value), color=\"black\") +\n", " theme_bw() + theme(legend.position=\"none\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "11. Increase `S` to 100 and `N` from 10 to 30, and then to 100. For each value\n", " of `N`, repeat steps 9 and 10 and note your observations (see Q4).\n", " (Keep `S=100` and `N=100` going forward.)\n", "\n", "12. Print the mean and the standard error (sd/sqrt(N)) of the first sample and\n", " compare the outputs with those of step 9. Note your observations (see Q3).\n", "\n", "\n", "### Questions\n", "* Q1 Briefly explain your test in step 3.\n", "\n", "* Q2 In your own words, describe the terms population, sample, and sampling distribution.\n", "\n", "* Q3 What did you observe when comparing the mean(s) and standard deviation(s)\n", " between the three distributions (population, first sample, sampling\n", " distribution)? Briefly explain the relationship between these three\n", " distributions (w.r.t. mean, standard deviation, and standard error).\n", "\n", "* Q4 What did you observe when increasing N, what are the implications?\n", "\n", "* Q5 (Optional) Repeat the exercise by changing the distribution of the\n", " population.\n", "\n", "* Q6 (Optional) Create an animated visualization that shows how the sample size\n", " `N` affects the sample distributions and the sampling distribution." ] } ], "metadata": { "kernelspec": { "name": "ir", "language": "R", "display_name": "R", "path": "/Users/rjust/Library/Jupyter/kernels/ir" } }, "nbformat": 4, "nbformat_minor": 4 }