{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "# In-class exercise Statistical Modeling: Solutions\n",
        "Author: Name1, Name2\n",
        "---\n",
        "\n",
        "# CLT is more than an airport in NC -- the Central Limit Theorem\n",
        "\n",
        "### Instructions\n",
        "\n",
        "1. Install (if needed) and load the following packages:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "library(tidyverse)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "2. Set the random seed to 1."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "set.seed(1)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "3. Create a population: 1M normally distributed values. Set mean and standard\n",
        "  deviation such that  about 96% of all values are between 80 and 120.\n",
        "  *Test whether the created population meets the stated criteria* (see Q1). It is\n",
        "  sufficient to ballpark the output of a simple calculation, but the output\n",
        "  should appear in your solution."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "pop <- "
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "4. From this population, draw 10 samples -- each with 10 values from the\n",
        "  population (sampled without replacement).\n",
        "\n",
        "5. Create a tibble in long format with two columns: (1) `Sample` and (2) `Value`.\n",
        "  Make `Sample` a factor (represents the sample id)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Number of samples\n",
        "S <- 10\n",
        "# Sample size for each sample\n",
        "N <- 10\n",
        "# Construct tibble for all samples\n",
        "samples <-"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "6. Plot the population (black line) and all samples (color-coded) using a density\n",
        "  plot."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "ggplot() + geom_density(data=samples, aes(x=Value, color=Sample)) +\n",
        "           geom_density(data=pop, aes(x=Value), color=\"black\") +\n",
        "           theme_bw() + theme(legend.position=\"none\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "7. Print the mean and standard deviation of the population and the means and\n",
        "   standard deviations of each of the samples.\n",
        "   Compare the values between the population and the samples (plot and printed\n",
        "   values) -- note your observations (see Q3)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Mean and sd of population\n",
        "mean(...)\n",
        "sd(...)\n",
        "# Mean and sd of each sample"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "8. Create a tibble with two columns: (1) `Sample` and (2) `Mean` that contains\n",
        "   the mean of each sample (one row per sample). What is the name of this new\n",
        "   distribution of sample means (see Q2)?"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "samples.means <- "
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "9. Print the mean and standard deviation of the population and the mean and\n",
        "   standard deviation of the distribution of sample means. Note your\n",
        "   observations (see Q3).\n",
        "\n",
        "10. Plot the population (black line) and all sample means (red line) using a\n",
        "    density plot."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "ggplot() + geom_density(data=samples.means, aes(x=Mean), color=\"red\") +\n",
        "           geom_density(data=pop, aes(x=Value), color=\"black\") +\n",
        "           theme_bw() + theme(legend.position=\"none\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "11. Increase `S` to 100 and `N` from 10 to 30, and then to 100. For each value\n",
        "    of `N`, repeat steps 9 and 10 and note your observations (see Q4).\n",
        "    (Keep `S=100` and `N=100` going forward.)\n",
        "\n",
        "12. Print the mean and the standard error (sd/sqrt(N)) of the first sample and\n",
        "    compare the outputs with those of step 9. Note your observations (see Q3).\n",
        "\n",
        "\n",
        "### Questions\n",
        "* Q1 Briefly explain your test in step 3.\n",
        "\n",
        "* Q2 In your own words, describe the terms population, sample, and sampling distribution.\n",
        "\n",
        "* Q3 What did you observe when comparing the mean(s) and standard deviation(s)\n",
        "   between the three distributions (population, first sample, sampling\n",
        "   distribution)? Briefly explain the relationship between these three\n",
        "   distributions (w.r.t. mean, standard deviation, and standard error).\n",
        "\n",
        "* Q4 What did you observe when increasing N, what are the implications?\n",
        "\n",
        "* Q5 (Optional) Repeat the exercise by changing the distribution of the\n",
        "     population.\n",
        "\n",
        "* Q6 (Optional) Create an animated visualization that shows how the sample size\n",
        "     `N` affects the sample distributions and the sampling distribution."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "ir",
      "language": "R",
      "display_name": "R",
      "path": "/Users/rjust/Library/Jupyter/kernels/ir"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}