suppressPackageStartupMessages({
library(tidyverse)
library(gganimate)
library(gifski)
})
Statistical modeling
Tutorial
Packages used
Random seed
Fix the random seed for reproducibility.
set.seed(1)
The actual seed value doesn’t really matter here. We fix it so that we can refer to particular samples.
Uniform distribution
1M data points, uniformly distributed between 0 and 10.
<- tibble(Value=runif(1000000, 0, 10)) uni
This creates a tibble with a single column Value
.
Mean and standard deviation
mean(uni$Value)
[1] 4.999223
sd(uni$Value)
[1] 2.886293
Recall that uni$Value
indexes the column Value
in tibble (data frame) uni
.
Density plot
Plot the density of all values in column Value
in the tibble (data frame) uni
.
ggplot(uni, aes(x=Value)) + geom_density() + theme_bw()
Drawing samples
Create a total of 10 samples and sample 10 values from the population for each.
<- 10
S <- 100
N
<- replicate(S, sample(uni$Value, N, replace=F), simplify=F)
l
# Name each entry in the list
names(l) <- seq(1, S)
# Create a tibble from the list
<- as_tibble(l)
samples.wide
# What does this tibble look like?
samples.wide
# A tibble: 100 × 10
`1` `2` `3` `4` `5` `6` `7` `8` `9` `10`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6.97 7.79 4.33 5.11 3.16 6.77 1.10 5.07 7.49 7.59
2 4.91 2.53 7.34 5.95 7.37 5.78 2.56 5.12 8.60 5.69
3 6.99 1.32 5.11 3.65 5.25 5.15 2.90 4.60 5.91 0.723
4 9.68 0.565 8.65 7.96 5.87 2.38 8.94 0.468 7.78 9.67
5 6.54 6.60 5.08 4.11 1.17 3.09 9.25 0.426 4.08 7.65
6 8.50 3.94 7.59 9.67 3.77 0.135 0.558 3.26 4.92 4.93
7 5.35 5.86 7.13 1.92 7.62 0.835 9.97 4.64 8.36 1.61
8 6.36 9.30 0.517 7.86 3.92 1.67 4.05 9.36 9.74 3.30
9 8.80 5.68 0.863 3.13 8.90 2.80 5.63 5.08 9.75 7.03
10 8.80 5.80 6.92 4.78 2.48 7.93 4.50 1.10 2.60 6.72
# ℹ 90 more rows
# samples.long <- ...
Creates a tibble with S columns – each a sample from the population.
The call to
sample
samplesN
values fromuni$Value
without replacement.replicate
repeats the call tosample
S
times and returns a list of samples. We simply enumerate the samples from1
toS
and use the sample id as column name.
There many ways of creating a tibble in long format.
You can use your R magic to convert from wide to long, or create a tibble in long format directly (e.g., bind_rows
).
Plot the population and all samples.
The population from which we sample is plotted in black. The samples from this population are color-coded.
ggplot() + geom_density(data=samples.long, aes(x=Value, color=Sample, color=Sample)) +
geom_density(data=uni, aes(x=Value), color="black") +
theme_minimal()
Warning: Duplicated aesthetics after name standardisation: colour
Notes
Each geom_density uses a different tibble (data frame).
Setting
color
orfill
within the aestheticsaes
leads to- grouping of values by labels and
- each density plot having a different line color. This example uses a default color palette, which can be changed. (Note: If
sample
was a numeric column, a color gradient would be used in stead of distinct colors.)
Setting
color
outside the aestheticsaes
leads to a single line color for the entire plot.When breaking lines, make sure to put the
+
at the end of the line, not on the new line.
Simulations
Animated plot of samples
Let’s plot the population and each sample again, but with an animation.
<- ggplot() + geom_density(data=samples.long, aes(x=Value, color=Sample, fill=Sample)) +
p geom_density(data=uni, aes(x=Value), color="black") +
theme_minimal() +
labs(title = "Sample: {current_frame}") +
transition_manual(Sample)
animate(p, fps=20)
`nframes` and `fps` adjusted to match transition
Galton board
Another example for a simulation: The Galton board.
. . .
<- function() {
makeplot ::quincunx(balls = 200, col.balls = rainbow(200))
animation
}::save_gif(makeplot(), "/tmp/galton.gif", 700, 500, delay=0.1) gifski
. . .
Computational notebooks
Render Rmarkdown files
::render('in_class3.Rmd') rmarkdown
Rscript -e "rmarkdown::render('in_class3.Rmd')"