• Descriptive
    • Moments
    • Concentration
    • Central Tendency
    • Variability
    • Stem-and-Leaf Plot
    • Histogram & Frequency Table
    • Data Quality Forensics
    • Conditional EDA
    • Quantiles
    • Kernel Density Estimation
    • Normal QQ Plot
    • Bootstrap Plot

    • Multivariate Descriptive Statistics
  • Distributions
    • Binomial Probabilities
    • Geometric Probabilities
    • Negative Binomial Probabilities
    • Hypergeometric Probabilities
    • Multinomial Probabilities
    • Dirichlet
    • Poisson Probabilities

    • Exponential
    • Gamma
    • Erlang
    • Weibull
    • Rayleigh
    • Maxwell-Boltzmann
    • Lognormal
    • Pareto
    • Inverse Gamma
    • Inverse Chi-Square

    • Beta
    • Power
    • Beta Prime (Inv. Beta)
    • Triangular

    • Normal (area)
    • Logistic
    • Laplace
    • Cauchy (standard)
    • Cauchy (location-scale)
    • Gumbel
    • Fréchet
    • Generalized Extreme Value

    • Normal RNG
    • ML Fitting
    • Tukey Lambda PPCC
    • Box-Cox Normality Plot
    • Noncentral t
    • Noncentral F
    • Sample Correlation r

    • Empirical Tests
  • Hypotheses
    • Theoretical Aspects of Hypothesis Testing
    • Bayesian Inference
    • Minimum Sample Size

    • Empirical Tests
    • Multivariate (pair-wise) Testing
  • Models
    • Manual Model Building
    • Guided Model Building
  • Time Series
    • Time Series Plot
    • Decomposition
    • Exponential Smoothing

    • Blocked Bootstrap Plot
    • Mean Plot
    • (P)ACF
    • VRM
    • Standard Deviation-Mean Plot
    • Spectral Analysis
    • ARIMA

    • Cross Correlation Function
    • Granger Causality
  1. Model Building Strategies
  2. 160  Model Validation
  • Preface
  • Getting Started
    • 1  Introduction
    • 2  Why Do We Need Innovative Technology?
    • 3  Basic Definitions
    • 4  The Big Picture: Why We Analyze Data
  • Introduction to Probability
    • 5  Definitions of Probability
    • 6  Jeffreys’ axiom system
    • 7  Bayes’ Theorem
    • 8  Sensitivity and Specificity
    • 9  Naive Bayes Classifier
    • 10  Law of Large Numbers

    • 11  Problems
  • Probability Distributions
    • 12  Bernoulli Distribution
    • 13  Binomial Distribution
    • 14  Geometric Distribution
    • 15  Negative Binomial Distribution
    • 16  Hypergeometric Distribution
    • 17  Multinomial Distribution
    • 18  Poisson Distribution

    • 19  Uniform Distribution (Rectangular Distribution)
    • 20  Normal Distribution (Gaussian Distribution)
    • 21  Gaussian Naive Bayes Classifier
    • 22  Chi Distribution
    • 23  Chi-squared Distribution (1 parameter)
    • 24  Chi-squared Distribution (2 parameters)
    • 25  Student t-Distribution
    • 26  Fisher F-Distribution
    • 27  Exponential Distribution
    • 28  Lognormal Distribution
    • 29  Gamma Distribution
    • 30  Beta Distribution
    • 31  Weibull Distribution
    • 32  Pareto Distribution
    • 33  Inverse Gamma Distribution
    • 34  Rayleigh Distribution
    • 35  Erlang Distribution
    • 36  Logistic Distribution
    • 37  Laplace Distribution
    • 38  Gumbel Distribution
    • 39  Cauchy Distribution
    • 40  Triangular Distribution
    • 41  Power Distribution
    • 42  Beta Prime Distribution
    • 43  Sample Correlation Distribution
    • 44  Dirichlet Distribution
    • 45  Generalized Extreme Value (GEV) Distribution
    • 46  Frechet Distribution
    • 47  Noncentral t Distribution
    • 48  Noncentral F Distribution
    • 49  Inverse Chi-Squared Distribution
    • 50  Maxwell-Boltzmann Distribution
    • 51  Distribution Relationship Map

    • 52  Problems
  • Descriptive Statistics & Exploratory Data Analysis
    • 53  Types of Data
    • 54  Datasheets

    • 55  Frequency Plot (Bar Plot)
    • 56  Frequency Table
    • 57  Contingency Table
    • 58  Binomial Classification Metrics
    • 59  Confusion Matrix
    • 60  ROC Analysis

    • 61  Stem-and-Leaf Plot
    • 62  Histogram
    • 63  Data Quality Forensics
    • 64  Quantiles
    • 65  Central Tendency
    • 66  Variability
    • 67  Skewness & Kurtosis
    • 68  Concentration
    • 69  Notched Boxplot
    • 70  Scatterplot
    • 71  Pearson Correlation
    • 72  Rank Correlation
    • 73  Partial Pearson Correlation
    • 74  Simple Linear Regression
    • 75  Moments
    • 76  Quantile-Quantile Plot (QQ Plot)
    • 77  Normal Probability Plot
    • 78  Probability Plot Correlation Coefficient Plot (PPCC Plot)
    • 79  Box-Cox Normality Plot
    • 80  Kernel Density Estimation
    • 81  Bivariate Kernel Density Plot
    • 82  Conditional EDA: Panel Diagnostics
    • 83  Bootstrap Plot (Central Tendency)
    • 84  Survey Scores Rank Order Comparison
    • 85  Cronbach Alpha

    • 86  Equi-distant Time Series
    • 87  Time Series Plot (Run Sequence Plot)
    • 88  Mean Plot
    • 89  Blocked Bootstrap Plot (Central Tendency)
    • 90  Standard Deviation-Mean Plot
    • 91  Variance Reduction Matrix
    • 92  (Partial) Autocorrelation Function
    • 93  Periodogram & Cumulative Periodogram

    • 94  Problems
  • Hypothesis Testing
    • 95  Normal Distributions revisited
    • 96  The Population
    • 97  The Sample
    • 98  The One-Sided Hypothesis Test
    • 99  The Two-Sided Hypothesis Test
    • 100  When to use a one-sided or two-sided test?
    • 101  What if \(\sigma\) is unknown?
    • 102  The Central Limit Theorem (revisited)
    • 103  Statistical Test of the Population Mean with known Variance
    • 104  Statistical Test of the Population Mean with unknown Variance
    • 105  Statistical Test of the Variance
    • 106  Statistical Test of the Population Proportion
    • 107  Statistical Test of the Standard Deviation \(\sigma\)
    • 108  Statistical Test of the difference between Means -- Independent/Unpaired Samples
    • 109  Statistical Test of the difference between Means -- Dependent/Paired Samples
    • 110  Statistical Test of the difference between Variances -- Independent/Unpaired Samples

    • 111  Hypothesis Testing for Research Purposes
    • 112  Decision Thresholds, Alpha, and Confidence Levels
    • 113  Bayesian Inference for Decision-Making
    • 114  One Sample t-Test
    • 115  Skewness & Kurtosis Tests
    • 116  Paired Two Sample t-Test
    • 117  Wilcoxon Signed-Rank Test
    • 118  Unpaired Two Sample t-Test
    • 119  Unpaired Two Sample Welch Test
    • 120  Two One-Sided Tests (TOST) for Equivalence
    • 121  Mann-Whitney U test (Wilcoxon Rank-Sum Test)
    • 122  Bayesian Two Sample Test
    • 123  Median Test based on Notched Boxplots
    • 124  Chi-Squared Tests for Count Data
    • 125  Kolmogorov-Smirnov Test
    • 126  One Way Analysis of Variance (1-way ANOVA)
    • 127  Kruskal-Wallis Test
    • 128  Two Way Analysis of Variance (2-way ANOVA)
    • 129  Repeated Measures ANOVA
    • 130  Friedman Test
    • 131  Testing Correlations
    • 132  A Note on Causality

    • 133  Problems
  • Regression Models
    • 134  Simple Linear Regression Model (SLRM)
    • 135  Multiple Linear Regression Model (MLRM)
    • 136  Logistic Regression
    • 137  Generalized Linear Models
    • 138  Multinomial and Ordinal Logistic Regression
    • 139  Cox Proportional Hazards Regression
    • 140  Conditional Inference Trees
    • 141  Leaf Diagnostics for Conditional Inference Trees
    • 142  Conditional Random Forests
    • 143  Hypothesis Testing with Linear Regression Models (from a Practical Point of View)

    • 144  Problems
  • Introduction to Time Series Analysis
    • 145  Case: the Market of Health and Personal Care Products
    • 146  Decomposition of Time Series
    • 147  Ad hoc Forecasting of Time Series
  • Box-Jenkins Analysis
    • 148  Introduction to Box-Jenkins Analysis
    • 149  Theoretical Concepts
    • 150  Stationarity
    • 151  Identifying ARMA parameters
    • 152  Estimating ARMA Parameters and Residual Diagnostics
    • 153  Forecasting with ARIMA models
    • 154  Intervention Analysis
    • 155  Cross-Correlation Function
    • 156  Transfer Function Noise Models
    • 157  General-to-Specific Modeling
  • Model Building Strategies
    • 158  Introduction to Model Building Strategies
    • 159  Manual Model Building
    • 160  Model Validation
    • 161  Regularization Methods
    • 162  Hyperparameter Optimization Strategies
    • 163  Guided Model Building in Practice
    • 164  Diagnostics, Revision, and Guided Forecasting
    • 165  Leakage, Target Encoding, and Robust Regression
  • References
  • Appendices
    • Appendices
    • A  Method Selection Guide
    • B  Presentations and Teaching Materials
    • C  R Language Concepts for Statistical Computing
    • D  Matrix Algebra
    • E  Standard Normal Table (Gaussian Table)
    • F  Critical values of Student’s \(t\) distribution with \(\nu\) degrees of freedom
    • G  Upper-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom
    • H  Lower-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom

Table of contents

  • 160.1 Overfitting and the Need for Held-Out Data
  • 160.2 Single Holdout Split
  • 160.3 Repeated Holdout
  • 160.4 Stratified Holdout
  • 160.5 Rolling-Origin Validation for Time Series
  • 160.6 Comparison Metrics for Regression and Classification
    • 160.6.1 Regression Metrics
    • 160.6.2 Classification Metrics
  • 160.7 k-Fold Cross-Validation
  • 160.8 Practical Exercises
  1. Model Building Strategies
  2. 160  Model Validation

160  Model Validation

A model that fits its training data well is not necessarily a model that will work well on new data. Model validation is the set of procedures used to estimate how a fitted model will perform on observations it has not seen. This chapter introduces the main validation strategies used throughout this part and in the Guided Model Building app.

160.1 Overfitting and the Need for Held-Out Data

Overfitting occurs when a model captures noise in the training data rather than the underlying pattern. A model that is too flexible will fit the training data almost perfectly but generalize poorly.

The following example makes this concrete. A small synthetic dataset is fitted with polynomials of increasing degree. As the degree increases, the training fit improves. To keep the held-out pattern readable for teaching purposes, the right-hand panel averages the training and held-out \(R^2\) values across many random splits of the same dataset rather than relying on a single noisy split.

set.seed(17)
n <- 50
x_full <- seq(0, 5, length.out = n)
y_full <- 2 + 0.9 * x_full - 0.12 * x_full^2 + rnorm(n, sd = 0.15)

set.seed(123)
split_idx <- replicate(100, sample(seq_len(n), size = floor(0.6 * n)), simplify = FALSE)

train_idx <- split_idx[[1]]
x_train <- x_full[train_idx]
y_train <- y_full[train_idx]
x_test <- x_full[-train_idx]
y_test <- y_full[-train_idx]

degrees <- 1:10
r2_train <- numeric(length(degrees))
r2_test <- numeric(length(degrees))

for (d in degrees) {
  train_vals <- numeric(length(split_idx))
  test_vals <- numeric(length(split_idx))

  for (r in seq_along(split_idx)) {
    idx <- split_idx[[r]]
    x_train_r <- x_full[idx]
    y_train_r <- y_full[idx]
    x_test_r <- x_full[-idx]
    y_test_r <- y_full[-idx]

    fit_r <- lm(y_train_r ~ poly(x_train_r, degree = d, raw = TRUE))
    train_vals[r] <- summary(fit_r)$r.squared

    pred_test_r <- predict(fit_r, newdata = data.frame(x_train_r = x_test_r))
    ss_res <- sum((y_test_r - pred_test_r)^2)
    ss_tot <- sum((y_test_r - mean(y_test_r))^2)
    test_vals[r] <- 1 - ss_res / ss_tot
  }

  r2_train[d] <- mean(train_vals)
  r2_test[d] <- mean(test_vals)
}

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))

plot(x_train, y_train, pch = 16, col = "grey40",
     xlab = "x", ylab = "y", main = "Polynomial fits")
points(x_test, y_test, pch = 1, col = "grey40")
legend("topright", pch = c(16, 1), col = "grey40",
       legend = c("training", "held-out"), bty = "n", cex = 0.85)

x_seq <- seq(min(x_full), max(x_full), length.out = 200)
cols <- c("steelblue", "darkorange", "firebrick")
show_degrees <- c(1, 3, 10)
for (i in seq_along(show_degrees)) {
  d <- show_degrees[i]
  fit <- lm(y_train ~ poly(x_train, degree = d, raw = TRUE))
  y_seq <- predict(fit, newdata = data.frame(x_train = x_seq))
  lines(x_seq, y_seq, col = cols[i], lwd = 2)
}
legend("topleft", legend = paste("degree", show_degrees),
       col = cols, lwd = 2, bty = "n", cex = 0.85)

plot(degrees, r2_train, type = "b", pch = 16, col = "steelblue", lwd = 2,
     xlab = "Polynomial degree", ylab = expression(R^2),
     ylim = c(min(c(r2_train, r2_test), na.rm = TRUE) - 0.03, 1.05),
     main = "Mean training vs mean held-out performance")
lines(degrees, r2_test, type = "b", pch = 17, col = "firebrick", lwd = 2, lty = 2)
legend("bottomleft", legend = c("training", "held-out"),
       col = c("steelblue", "firebrick"), pch = c(16, 17),
       lty = c(1, 2), lwd = 2, bty = "n", cex = 0.85)

Overfitting illustrated with polynomial regression. Left: fitted curves at degree 1, 3, and 10 for one illustrative split. Right: mean training and mean held-out R-squared across many random splits of the same dataset.

The left panel shows that the degree-10 polynomial follows much more of the local variation than the lower-degree fits. The right panel shows that the mean held-out \(R^2\) drops once model flexibility becomes excessive, even though mean training \(R^2\) continues to rise. The point where the two lines diverge is where overfitting begins.

This is why validation requires held-out data: performance on the training set alone cannot tell you whether the model has learned the pattern or memorized the noise.

160.2 Single Holdout Split

The simplest validation method splits the data once into a training set and a test set. The model is fitted on the training set and evaluated on the test set. This is exactly what the training percentage slider in the app available in the menu Models / Manual Model Building does (see Section 159.7).

library(naivebayes)
naivebayes 1.0.0 loaded
For more information please visit: 
https://majkamichal.github.io/naivebayes/
data("Pima.tr", package = "MASS")

set.seed(42)
n <- nrow(Pima.tr)
idx <- sample(seq_len(n), size = floor(0.8 * n))
train_data <- Pima.tr[idx, ]
test_data <- Pima.tr[-idx, ]

nb_fit <- naive_bayes(type ~ ., data = train_data)
test_x <- subset(test_data, select = -type)
preds <- predict(nb_fit, newdata = test_x)

cm <- table(Predicted = preds, Actual = test_data$type)

barplot(
  cm,
  beside = TRUE,
  col = c("steelblue", "firebrick"),
  legend.text = rownames(cm),
  args.legend = list(title = "Predicted", bty = "n"),
  xlab = "Actual class",
  ylab = "Count",
  border = NA
)

Confusion matrix from a single holdout split of Pima.tr at 80% training

The single holdout split has a limitation: its result depends on which rows happen to fall into the training set and which fall into the test set. A different random split can produce a noticeably different accuracy estimate.

160.3 Repeated Holdout

Repeated holdout addresses the instability of a single split by repeating the procedure many times. Each repetition uses a fresh random split, fits the model, and records the held-out performance. The final estimate is the average across repetitions.

set.seed(123)
n_reps <- 100
accuracies <- numeric(n_reps)

for (r in seq_len(n_reps)) {
  idx <- sample(seq_len(n), size = floor(0.8 * n))
  train_r <- Pima.tr[idx, ]
  test_r <- Pima.tr[-idx, ]

  nb_r <- naive_bayes(type ~ ., data = train_r)
  test_r_x <- subset(test_r, select = -type)
  pred_r <- predict(nb_r, newdata = test_r_x)
  accuracies[r] <- mean(pred_r == test_r$type)
}

hist(
  accuracies,
  breaks = 15,
  col = "steelblue",
  border = "white",
  main = "",
  xlab = "Held-out accuracy",
  ylab = "Frequency"
)
abline(v = mean(accuracies), col = "firebrick", lwd = 2, lty = 2)
text(mean(accuracies), par("usr")[4] * 0.9,
     labels = paste("mean =", round(mean(accuracies), 3)),
     pos = 4, col = "firebrick", cex = 0.9)

Distribution of held-out accuracy across 100 repeated holdout splits

The histogram shows that individual splits produce accuracy values that scatter over a noticeable range. The dashed line marks the average. This average is a more reliable estimate than any single split because it absorbs the randomness of the split itself.

The Guided Model Building app uses repeated holdout as its default validation strategy for tabular data (see Section 163.4.4 and Section 163.5.2).

160.4 Stratified Holdout

Random holdout splits do not guarantee that the class proportions in the training and test sets match the population proportions. When the target is unbalanced — as in the Pima dataset, where about two thirds of cases are negative — a random split can occasionally produce a test set with a substantially different class distribution.

Stratified holdout solves this by splitting within each class separately, so the proportions are preserved by construction.

library(caret)
Loading required package: ggplot2
Loading required package: lattice
set.seed(77)
n_splits <- 20
true_share <- mean(Pima.tr$type == "Yes")

random_shares <- numeric(n_splits)
stratified_shares <- numeric(n_splits)

for (s in seq_len(n_splits)) {
  random_idx <- sample(seq_len(n), size = floor(0.8 * n))
  random_test <- Pima.tr[-random_idx, ]
  random_shares[s] <- mean(random_test$type == "Yes")

  strat_idx <- createDataPartition(Pima.tr$type, p = 0.8, list = FALSE)
  strat_test <- Pima.tr[-strat_idx, ]
  stratified_shares[s] <- mean(strat_test$type == "Yes")
}

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))

barplot(random_shares, col = "steelblue", border = NA,
        ylim = c(0, 0.6), ylab = "Positive-class share",
        main = "Random splits", xlab = "Split")
abline(h = true_share, col = "firebrick", lwd = 2, lty = 2)

barplot(stratified_shares, col = "steelblue", border = NA,
        ylim = c(0, 0.6), ylab = "Positive-class share",
        main = "Stratified splits", xlab = "Split")
abline(h = true_share, col = "firebrick", lwd = 2, lty = 2)

Positive-class share across 20 random splits (left) versus 20 stratified splits (right). The dashed line marks the true share.

The random splits show visible wobble in the positive-class share. The stratified splits cluster tightly around the true proportion (dashed line). When the dataset is small or the class balance is uneven, stratification prevents accidental evaluation on an unrepresentative test set.

160.5 Rolling-Origin Validation for Time Series

When the data are ordered in time, random holdout splits are not appropriate because they ignore temporal dependence. A model should not be evaluated on observations that precede the training period.

Rolling-origin validation respects the time ordering. The procedure is:

  1. train the model on all observations up to a cutoff point,
  2. forecast the next \(h\) observations,
  3. record the forecast error,
  4. advance the cutoff by one period and repeat.

The error metrics are averaged across all origins to produce a stable summary.

par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))

nottem_vals <- as.numeric(nottem)
nottem_time <- time(nottem)
n_obs <- length(nottem_vals)

origin_1 <- 180
origin_2 <- 204
h <- 12

plot(nottem_time, nottem_vals, type = "l", col = "grey40",
     xlab = "Year", ylab = "Temperature",
     main = "Two rolling-origin windows")

rect(nottem_time[1], par("usr")[3],
     nottem_time[origin_1], par("usr")[4],
     col = rgb(0.27, 0.51, 0.71, 0.15), border = NA)
rect(nottem_time[origin_1 + 1], par("usr")[3],
     nottem_time[min(origin_1 + h, n_obs)], par("usr")[4],
     col = rgb(0.80, 0.20, 0.20, 0.15), border = NA)

rect(nottem_time[1], par("usr")[3],
     nottem_time[origin_2], par("usr")[4],
     col = rgb(0.27, 0.51, 0.71, 0.10), border = NA)
rect(nottem_time[origin_2 + 1], par("usr")[3],
     nottem_time[min(origin_2 + h, n_obs)], par("usr")[4],
     col = rgb(0.80, 0.20, 0.20, 0.10), border = NA)

legend("topright", fill = c(rgb(0.27, 0.51, 0.71, 0.3), rgb(0.80, 0.20, 0.20, 0.3)),
       legend = c("training window", "test window"), bty = "n", cex = 0.85)

train_ts <- ts(nottem_vals[1:origin_2], frequency = 12)
arima_fit <- forecast::auto.arima(train_ts)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
fc <- forecast::forecast(arima_fit, h = h)

actual_vals <- nottem_vals[(origin_2 + 1):min(origin_2 + h, n_obs)]
fc_time <- nottem_time[(origin_2 + 1):min(origin_2 + h, n_obs)]

plot(fc_time, actual_vals, type = "b", pch = 16, col = "grey40",
     xlab = "Year", ylab = "Temperature",
     ylim = range(c(actual_vals, as.numeric(fc$mean)), na.rm = TRUE),
     main = "Forecast versus actual (second origin)")
lines(fc_time, as.numeric(fc$mean)[seq_along(fc_time)],
      col = "steelblue", lwd = 2)
points(fc_time, as.numeric(fc$mean)[seq_along(fc_time)],
       pch = 17, col = "steelblue")
legend("topright", pch = c(16, 17), col = c("grey40", "steelblue"),
       legend = c("actual", "forecast"), bty = "n", cex = 0.85)

Rolling-origin validation on the nottem series. Top: two training windows (blue) and their corresponding test windows (red). Bottom: forecast versus actual for the second origin.

The top panel shows how the training window grows as the origin advances. The bottom panel shows the forecast produced at the second origin against the actual values. The Guided Model Building app uses rolling-origin validation for all time-series models (see Section 164.4.1 and Chapter 153).

160.6 Comparison Metrics for Regression and Classification

Validation produces held-out predictions. To compare models, those predictions must be summarized into metrics. The appropriate metrics depend on whether the target is continuous (regression) or categorical (classification).

160.6.1 Regression Metrics

Metric Definition Interpretation
RMSE \(\sqrt{\frac{1}{n}\sum (y_i - \hat{y}_i)^2}\) average error magnitude, penalizes large errors more
MAE \(\frac{1}{n}\sum |y_i - \hat{y}_i|\) average absolute error, less sensitive to outliers
\(R^2\) \(1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}\) proportion of variance explained (see Chapter 135)
library(MASS)

set.seed(55)
idx <- sample(seq_len(nrow(Cars93)), size = floor(0.8 * nrow(Cars93)))
cars_train <- Cars93[idx, ]
cars_test <- Cars93[-idx, ]

fit <- lm(Price ~ Horsepower + Fuel.tank.capacity + EngineSize, data = cars_train)
pred <- predict(fit, newdata = cars_test)
actual <- cars_test$Price

plot(actual, pred, pch = 16, col = "steelblue",
     xlab = "Actual price", ylab = "Predicted price",
     xlim = range(c(actual, pred)), ylim = range(c(actual, pred)))
abline(0, 1, col = "grey40", lwd = 1, lty = 2)

residuals <- abs(actual - pred)
top3 <- order(residuals, decreasing = TRUE)[1:3]
segments(actual[top3], pred[top3], actual[top3], actual[top3],
         col = "firebrick", lwd = 2)
points(actual[top3], pred[top3], pch = 16, col = "firebrick")

rmse <- sqrt(mean((actual - pred)^2))
mae <- mean(abs(actual - pred))
ss_res <- sum((actual - pred)^2)
ss_tot <- sum((actual - mean(actual))^2)
r2 <- 1 - ss_res / ss_tot

legend("topleft",
       legend = c(
         paste("RMSE =", round(rmse, 2)),
         paste("MAE =", round(mae, 2)),
         paste("R\u00b2 =", round(r2, 3))
       ),
       bty = "n", cex = 0.9)

Predicted versus actual price for Cars93. Segments mark the three largest residuals.

The three red segments connect the largest prediction errors to the diagonal. These are the observations that contribute most to the RMSE. If the squared penalty matters (as it does in RMSE), the model’s performance is substantially driven by its worst predictions.

160.6.2 Classification Metrics

Metric Source Interpretation
Accuracy confusion matrix share of all observations correctly classified
AUC ROC curve (Chapter 60) overall discriminative ability across all thresholds
Sensitivity confusion matrix (Chapter 59) share of actual positives correctly identified
Specificity confusion matrix (Chapter 59) share of actual negatives correctly identified
F1 harmonic mean of precision and sensitivity balances precision and recall
MCC Matthews correlation coefficient balanced measure even with unequal class sizes

F1 is the harmonic mean of precision and sensitivity. Precision is the share of predicted positives that are truly positive. The harmonic mean ensures that F1 is low whenever either precision or sensitivity is low.

MCC (Matthews correlation coefficient) is a correlation coefficient between the observed and predicted binary classifications. It takes values between \(-1\) and \(+1\), where \(+1\) indicates perfect prediction, \(0\) indicates no better than random, and \(-1\) indicates total disagreement. Unlike accuracy, MCC remains informative when the classes are unbalanced.

160.7 k-Fold Cross-Validation

In k-fold cross-validation, the data are divided into \(k\) equally sized groups (folds). The model is trained on \(k - 1\) folds and evaluated on the remaining fold. This is repeated \(k\) times so that every observation serves as a test case exactly once. The final metric is the average across folds.

par(mar = c(2, 6, 2, 1))
k <- 5
plot(NULL, xlim = c(0.5, k + 0.5), ylim = c(0.5, k + 0.5),
     xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n",
     main = "")

for (i in 1:k) {
  for (j in 1:k) {
    col <- if (j == i) "firebrick" else "steelblue"
    alpha <- if (j == i) 0.7 else 0.3
    rect(j - 0.4, k - i + 1 - 0.35, j + 0.4, k - i + 1 + 0.35,
         col = adjustcolor(col, alpha.f = alpha), border = "white", lwd = 2)
    label <- if (j == i) "test" else "train"
    text(j, k - i + 1, label, cex = 0.7, col = "white", font = 2)
  }
  mtext(paste("Iteration", i), side = 2, at = k - i + 1, las = 1, cex = 0.8, line = 0.5)
}
mtext(paste("Fold", 1:k), side = 3, at = 1:k, cex = 0.8)

Schematic of 5-fold cross-validation. Each row is one iteration. The shaded block is the test fold.

This handbook uses repeated holdout rather than k-fold cross-validation as the primary validation method. The reason is practical: repeated holdout maps directly to the training percentage slider in the app available in the menu Models / Manual Model Building and to the repeated-holdout validation used by the Guided Model Building app. The two methods address the same fundamental problem — estimating held-out performance — but repeated holdout is easier to connect to the tools used throughout the book.

k-fold cross-validation is widely used in practice and has the advantage that every observation appears in the test set exactly once. It is especially useful when the dataset is small and wasting observations on a large held-out set is costly.

160.8 Practical Exercises

  1. Run the overfitting example above with a different set.seed value. Does the crossover point between training and held-out \(R^2\) always occur at the same polynomial degree?
  2. Change the number of repetitions in the repeated holdout example from 100 to 10. How does the histogram change? How does the mean change?
  3. Modify the stratified holdout example to use a 60/40 split instead of 80/20. Does the difference between random and stratified splits become larger or smaller?
  4. In the rolling-origin example, change the forecast horizon from 12 to 24. How does this affect the forecast accuracy?
  5. Compute the RMSE and MAE for the Cars93 regression example. Which metric is more affected by the three largest residuals?
  6. Explain in your own words why k-fold cross-validation guarantees that every observation is tested exactly once, whereas repeated holdout does not.
159  Manual Model Building
161  Regularization Methods

© 2026 Patrick Wessa. Provided as-is, without warranty.

Feedback: e-mail | Anonymous contributions: click to copy (Sats) | click to copy (XMR)

Cookie Preferences