Code
x_num <- c(2.1, 3.4, 5.0)
x_chr <- c("A", "B", "C")
typeof(x_num)
length(x_num)This appendix provides a structured overview of core R language concepts for readers who want to move beyond app-based usage and perform analysis directly in code. The focus is on concepts that are repeatedly used in statistical workflows.
R is an interpreted language with vectorized semantics and a functional core. Most statistical procedures in R are exposed as functions that operate on vectors, matrices, data frames, or model objects. In practice, this means:
R can be used procedurally, functionally, and object-orientedly (e.g., S3 classes in many statistical packages).
An atomic vector stores values of a single underlying type (logical, integer, double, character, etc.).
Factors represent categorical variables with explicit levels. Many model-fitting functions automatically encode factors as contrast matrices (treatment/dummy coding by default; see contrasts() for alternatives).
Lists can store heterogeneous elements and are widely used for function returns.
Matrices are 2-dimensional homogeneous structures. Arrays generalize this to higher dimensions.
A data frame is a tabular structure with rows as observations and columns as variables. Columns can have different types. Tibbles are modern data frames with stricter printing and subsetting behavior.
R distinguishes assignment from comparison:
<-, = (= also names function arguments, e.g., na.rm = TRUE; prefer <- for variable assignment to avoid ambiguity)==<, <=, >, >=, !=!, &, | (element-wise; use in subsetting and ifelse())&&, || (first element only; use in if() control flow)%in%Operator precedence matters in complex expressions. Parentheses should be used explicitly when readability is important.
R distinguishes several special values:
NA: missing valueNaN: undefined numeric result (e.g., 0/0)Inf, -Inf: positive/negative infinityMost summary functions propagate NA unless na.rm = TRUE is set.
R may coerce types implicitly. This is convenient but can introduce subtle errors if not monitored.
In statistical work, type checks (str, typeof, is.factor, is.numeric) should be part of routine data validation.
Subsetting can be positional, logical, or name-based.
For matrices/data frames, be explicit about dimensions (drop = FALSE) when a 2D structure must be preserved.
Functions encapsulate logic and improve reproducibility. R uses lexical scoping: a function resolves names in its local environment first, then in enclosing environments.
As a rule, functions should avoid side effects (e.g., modifying global objects) unless explicitly intended.
R supports if, for, while, and repeat, but vectorized solutions are often clearer and faster for statistical tasks.
The apply family (apply, lapply, sapply, vapply) is frequently used for structured iteration.
Many methods in R use the formula interface:
y ~ x1 + x2: additive modely ~ x1 * x2: interaction plus main effectsy ~ .: all available predictorsy ~ x - 1: no interceptModel outputs are structured objects; extraction of coefficients, residuals, fitted values, and diagnostics is typically done with accessor functions.
R’s statistical ecosystem is package-based. A robust workflow includes:
install.packages)library)When moving from app-based analysis to coding, prioritize the following sequence:
vector, factor, data.frame)NA, na.rm, complete.cases)This sequence is sufficient to reproduce most analyses presented in this handbook with R.