R Basics

Required packages

library(tidyverse)
library(nycflights23)
Note

All packages are automatically loaded in this tutorial.

The tidyverse package is a metapackage – a collection of many related packages.

Quoting from the tidyverse website:

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Explore a given data frame

Note

The flights data frame is provided by the nycflights23 package. Usually, you will import data from an external source; a later tutorial will cover various data import options.

Column names

Structure and example data

First/last n rows

Assignments

Assignment operators

All three statements are equivalent. R allows assignments at the beginning or at the end of a statement. Also note the two different assignment operators used (<- vs. =); you may use either, but using <- for variable assignment and = for named arguments may improve clarity and readability.

Summary statistics

A first (simple) visualization

Documentation

Questions

  • Q1 Look at the documentation for the head function (?head). How many rows does head print by default? What programming language feature is used to achieve this default behavior?

  • Q2 Provide two syntactically different calls of the head function that both result in the same output of the first 3 rows of the flights data frame. Which syntax is preferable (to you) and why?

  • Q3 Given the output of summary(flights), what do you observe in terms of data types and descriptive statistics? Compare the output for carrier and time_hour, which is more useful? What changes would improve usefulness?

  • Q4 Change the plot to show the distribution of distance per month (i.e., x=month). Look at the warning message: what is the root cause of this problem? What are possible solutions?

  • Q5 Change the plot to show the distribution of distance per month, grouped by origin airport (i.e., x=month, fill=origin). Look at the plot and/or any warning messages: what is the correct solution (conceptually) to avoid the issues observed in Q4 and Q5?