CSEP 590
Building Data Analysis Pipelines

Fall 2024

Advanced Statistical Modeling

Linear models

Packages used

library(tidyverse)
# From tidymodels to work with model objects
library(broom)

Simulate a population

Population of 1000 pairs

  • x is uniformly distributed between 1 and 5.
  • y has a known linear relationship to x: y = a * x + b.
  • a is 1.5.
  • b is normally distributed, with a mean of 0.

Visualize the entire population

Regression model for the population

Draw 100 samples of size 50

Plot the samples

Regression model for each sample

List columns can store values of arbitrary types

This code puts the fitted model in a column named model. This is very useful when working with more complex data structures and avoids unnecessary (un)nesting.

Model fit, slope, and intercept

Let’s first define three helper functions for simplicity:

Tidy vs. base R

  • get_R2 is a base R solution; get_slope and get_icept use tidy from the broom package.
  • tidy(x) converts x into a tibble, which makes it easier to code * * consistently.
  • pull is similar to select, except it returns a vector (as opposed to a tibble).

Model fit, slope, and intercept

The map functions

  • map applies a given function to each element of a vector
  • map_dbl treats (coerces) the returned value as double

Plot all slopes

Plot all intercepts instead

Assumption checking

Residual plot

Run plot(m) for all diagnostic plots

Assumption violations

Population with a non-linear relationship

Visualize the entire population

First modeling attempt

Second modeling attempt (log-transformed y)

Final modeling attempt (GLM)