• Descriptive
    • Moments
    • Concentration
    • Central Tendency
    • Variability
    • Stem-and-Leaf Plot
    • Histogram & Frequency Table
    • Data Quality Forensics
    • Conditional EDA
    • Quantiles
    • Kernel Density Estimation
    • Normal QQ Plot
    • Bootstrap Plot

    • Multivariate Descriptive Statistics
  • Distributions
    • Binomial Probabilities
    • Geometric Probabilities
    • Negative Binomial Probabilities
    • Hypergeometric Probabilities
    • Multinomial Probabilities
    • Dirichlet
    • Poisson Probabilities

    • Exponential
    • Gamma
    • Erlang
    • Weibull
    • Rayleigh
    • Maxwell-Boltzmann
    • Lognormal
    • Pareto
    • Inverse Gamma
    • Inverse Chi-Square

    • Beta
    • Power
    • Beta Prime (Inv. Beta)
    • Triangular

    • Normal (area)
    • Logistic
    • Laplace
    • Cauchy (standard)
    • Cauchy (location-scale)
    • Gumbel
    • Fréchet
    • Generalized Extreme Value

    • Normal RNG
    • ML Fitting
    • Tukey Lambda PPCC
    • Box-Cox Normality Plot
    • Noncentral t
    • Noncentral F
    • Sample Correlation r

    • Empirical Tests
  • Hypotheses
    • Theoretical Aspects of Hypothesis Testing
    • Bayesian Inference
    • Minimum Sample Size

    • Empirical Tests
    • Multivariate (pair-wise) Testing
  • Models
    • Manual Model Building
    • Guided Model Building
  • Time Series
    • Time Series Plot
    • Decomposition
    • Exponential Smoothing

    • Blocked Bootstrap Plot
    • Mean Plot
    • (P)ACF
    • VRM
    • Standard Deviation-Mean Plot
    • Spectral Analysis
    • ARIMA

    • Cross Correlation Function
    • Granger Causality
  1. Model Building Strategies
  2. 161  Regularization Methods
  • Preface
  • Getting Started
    • 1  Introduction
    • 2  Why Do We Need Innovative Technology?
    • 3  Basic Definitions
    • 4  The Big Picture: Why We Analyze Data
  • Introduction to Probability
    • 5  Definitions of Probability
    • 6  Jeffreys’ axiom system
    • 7  Bayes’ Theorem
    • 8  Sensitivity and Specificity
    • 9  Naive Bayes Classifier
    • 10  Law of Large Numbers

    • 11  Problems
  • Probability Distributions
    • 12  Bernoulli Distribution
    • 13  Binomial Distribution
    • 14  Geometric Distribution
    • 15  Negative Binomial Distribution
    • 16  Hypergeometric Distribution
    • 17  Multinomial Distribution
    • 18  Poisson Distribution

    • 19  Uniform Distribution (Rectangular Distribution)
    • 20  Normal Distribution (Gaussian Distribution)
    • 21  Gaussian Naive Bayes Classifier
    • 22  Chi Distribution
    • 23  Chi-squared Distribution (1 parameter)
    • 24  Chi-squared Distribution (2 parameters)
    • 25  Student t-Distribution
    • 26  Fisher F-Distribution
    • 27  Exponential Distribution
    • 28  Lognormal Distribution
    • 29  Gamma Distribution
    • 30  Beta Distribution
    • 31  Weibull Distribution
    • 32  Pareto Distribution
    • 33  Inverse Gamma Distribution
    • 34  Rayleigh Distribution
    • 35  Erlang Distribution
    • 36  Logistic Distribution
    • 37  Laplace Distribution
    • 38  Gumbel Distribution
    • 39  Cauchy Distribution
    • 40  Triangular Distribution
    • 41  Power Distribution
    • 42  Beta Prime Distribution
    • 43  Sample Correlation Distribution
    • 44  Dirichlet Distribution
    • 45  Generalized Extreme Value (GEV) Distribution
    • 46  Frechet Distribution
    • 47  Noncentral t Distribution
    • 48  Noncentral F Distribution
    • 49  Inverse Chi-Squared Distribution
    • 50  Maxwell-Boltzmann Distribution
    • 51  Distribution Relationship Map

    • 52  Problems
  • Descriptive Statistics & Exploratory Data Analysis
    • 53  Types of Data
    • 54  Datasheets

    • 55  Frequency Plot (Bar Plot)
    • 56  Frequency Table
    • 57  Contingency Table
    • 58  Binomial Classification Metrics
    • 59  Confusion Matrix
    • 60  ROC Analysis

    • 61  Stem-and-Leaf Plot
    • 62  Histogram
    • 63  Data Quality Forensics
    • 64  Quantiles
    • 65  Central Tendency
    • 66  Variability
    • 67  Skewness & Kurtosis
    • 68  Concentration
    • 69  Notched Boxplot
    • 70  Scatterplot
    • 71  Pearson Correlation
    • 72  Rank Correlation
    • 73  Partial Pearson Correlation
    • 74  Simple Linear Regression
    • 75  Moments
    • 76  Quantile-Quantile Plot (QQ Plot)
    • 77  Normal Probability Plot
    • 78  Probability Plot Correlation Coefficient Plot (PPCC Plot)
    • 79  Box-Cox Normality Plot
    • 80  Kernel Density Estimation
    • 81  Bivariate Kernel Density Plot
    • 82  Conditional EDA: Panel Diagnostics
    • 83  Bootstrap Plot (Central Tendency)
    • 84  Survey Scores Rank Order Comparison
    • 85  Cronbach Alpha

    • 86  Equi-distant Time Series
    • 87  Time Series Plot (Run Sequence Plot)
    • 88  Mean Plot
    • 89  Blocked Bootstrap Plot (Central Tendency)
    • 90  Standard Deviation-Mean Plot
    • 91  Variance Reduction Matrix
    • 92  (Partial) Autocorrelation Function
    • 93  Periodogram & Cumulative Periodogram

    • 94  Problems
  • Hypothesis Testing
    • 95  Normal Distributions revisited
    • 96  The Population
    • 97  The Sample
    • 98  The One-Sided Hypothesis Test
    • 99  The Two-Sided Hypothesis Test
    • 100  When to use a one-sided or two-sided test?
    • 101  What if \(\sigma\) is unknown?
    • 102  The Central Limit Theorem (revisited)
    • 103  Statistical Test of the Population Mean with known Variance
    • 104  Statistical Test of the Population Mean with unknown Variance
    • 105  Statistical Test of the Variance
    • 106  Statistical Test of the Population Proportion
    • 107  Statistical Test of the Standard Deviation \(\sigma\)
    • 108  Statistical Test of the difference between Means -- Independent/Unpaired Samples
    • 109  Statistical Test of the difference between Means -- Dependent/Paired Samples
    • 110  Statistical Test of the difference between Variances -- Independent/Unpaired Samples

    • 111  Hypothesis Testing for Research Purposes
    • 112  Decision Thresholds, Alpha, and Confidence Levels
    • 113  Bayesian Inference for Decision-Making
    • 114  One Sample t-Test
    • 115  Skewness & Kurtosis Tests
    • 116  Paired Two Sample t-Test
    • 117  Wilcoxon Signed-Rank Test
    • 118  Unpaired Two Sample t-Test
    • 119  Unpaired Two Sample Welch Test
    • 120  Two One-Sided Tests (TOST) for Equivalence
    • 121  Mann-Whitney U test (Wilcoxon Rank-Sum Test)
    • 122  Bayesian Two Sample Test
    • 123  Median Test based on Notched Boxplots
    • 124  Chi-Squared Tests for Count Data
    • 125  Kolmogorov-Smirnov Test
    • 126  One Way Analysis of Variance (1-way ANOVA)
    • 127  Kruskal-Wallis Test
    • 128  Two Way Analysis of Variance (2-way ANOVA)
    • 129  Repeated Measures ANOVA
    • 130  Friedman Test
    • 131  Testing Correlations
    • 132  A Note on Causality

    • 133  Problems
  • Regression Models
    • 134  Simple Linear Regression Model (SLRM)
    • 135  Multiple Linear Regression Model (MLRM)
    • 136  Logistic Regression
    • 137  Generalized Linear Models
    • 138  Multinomial and Ordinal Logistic Regression
    • 139  Cox Proportional Hazards Regression
    • 140  Conditional Inference Trees
    • 141  Leaf Diagnostics for Conditional Inference Trees
    • 142  Conditional Random Forests
    • 143  Hypothesis Testing with Linear Regression Models (from a Practical Point of View)

    • 144  Problems
  • Introduction to Time Series Analysis
    • 145  Case: the Market of Health and Personal Care Products
    • 146  Decomposition of Time Series
    • 147  Ad hoc Forecasting of Time Series
  • Box-Jenkins Analysis
    • 148  Introduction to Box-Jenkins Analysis
    • 149  Theoretical Concepts
    • 150  Stationarity
    • 151  Identifying ARMA parameters
    • 152  Estimating ARMA Parameters and Residual Diagnostics
    • 153  Forecasting with ARIMA models
    • 154  Intervention Analysis
    • 155  Cross-Correlation Function
    • 156  Transfer Function Noise Models
    • 157  General-to-Specific Modeling
  • Model Building Strategies
    • 158  Introduction to Model Building Strategies
    • 159  Manual Model Building
    • 160  Model Validation
    • 161  Regularization Methods
    • 162  Hyperparameter Optimization Strategies
    • 163  Guided Model Building in Practice
    • 164  Diagnostics, Revision, and Guided Forecasting
    • 165  Leakage, Target Encoding, and Robust Regression
  • References
  • Appendices
    • Appendices
    • A  Method Selection Guide
    • B  Presentations and Teaching Materials
    • C  R Language Concepts for Statistical Computing
    • D  Matrix Algebra
    • E  Standard Normal Table (Gaussian Table)
    • F  Critical values of Student’s \(t\) distribution with \(\nu\) degrees of freedom
    • G  Upper-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom
    • H  Lower-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom

Table of contents

  • 161.1 Why Coefficients Need Shrinkage
  • 161.2 The Three Standard Penalties
  • 161.3 A Small Regression Example with mtcars
    • 161.3.1 Coefficient Paths
    • 161.3.2 Choosing lambda by Cross-Validation
    • 161.3.3 What Shrinkage Looks Like Numerically
  • 161.4 The Same Logic in Logistic Regression
  • 161.5 Try Regularization in the Apps
    • 161.5.1 Manual Model Building: Explicit Search on the GLM Tab
    • 161.5.2 Guided Model Building: Regularized Candidates Inside a Workflow
  • 161.6 Practical Reading Rule
  • 161.7 Practical Exercises
  1. Model Building Strategies
  2. 161  Regularization Methods

161  Regularization Methods

Regularization is used when a regression or classification model is flexible enough to fit the training data too aggressively. The idea is simple: do not let the coefficients move as freely as ordinary fitting would allow.

Instead of minimizing only the residual sum of squares or the negative log-likelihood, a regularized method adds a penalty for large coefficients. The penalty discourages unstable fits and usually improves out-of-sample performance when predictors are numerous, correlated, or both.

This chapter uses glmnet because it provides the three standard penalties in one framework:

Method Penalty idea Practical effect
Ridge shrink all coefficients toward zero keeps every predictor, reduces instability
Lasso shrink coefficients and allow some to become exactly zero combines shrinkage with variable selection
Elastic net blend ridge and lasso useful when predictors are both numerous and correlated

161.1 Why Coefficients Need Shrinkage

When predictors are strongly related to one another, ordinary least squares and ordinary logistic regression can produce coefficients that move around a lot from one sample to the next. The fitted values may still look reasonable, but the individual coefficients become harder to trust.

Regularization does not eliminate the need for good data, sensible predictors, or leakage protection. It does something narrower:

  • it reduces coefficient volatility,
  • it lowers the chance that the model chases noise,
  • it often improves predictive performance on new data.

This is why regularization belongs more naturally to a predictive workflow than to a confirmatory one (see Section 158.2). A lasso model that sets some coefficients to zero can be useful, but it should not be mistaken for a substantive scientific argument about causation.

161.2 The Three Standard Penalties

For linear regression, the ordinary least squares objective is

\[ \sum_{i=1}^{n}(y_i - \hat y_i)^2. \]

Regularization adds a penalty term:

\[ \sum_{i=1}^{n}(y_i - \hat y_i)^2 + \lambda \cdot \text{Penalty}(\beta). \]

The tuning constant \(\lambda\) controls how strongly coefficients are shrunk.

  • Ridge regression uses \(\sum_j \beta_j^2\).
  • Lasso regression uses \(\sum_j |\beta_j|\).
  • Elastic net uses a weighted blend of the two.

The second tuning constant is therefore alpha:

  • alpha = 0 gives ridge,
  • alpha = 1 gives lasso,
  • values between 0 and 1 give elastic net.

The key practical point is that the data do not estimate lambda for you automatically. It must be chosen by validation, which is why this chapter connects directly to Chapter 160 and Chapter 162.

161.3 A Small Regression Example with mtcars

The mtcars dataset is small enough that ordinary fitting can be sensitive to the exact training sample. That makes it a convenient teaching example for shrinkage.

library(glmnet)

data(mtcars)

x <- model.matrix(mpg ~ . - 1, data = mtcars)
y <- mtcars$mpg

rmse <- function(actual, pred) sqrt(mean((actual - pred)^2))

set.seed(42)
idx <- sample(seq_len(nrow(mtcars)), size = 24)

x_train <- x[idx, ]
x_test <- x[-idx, ]
y_train <- y[idx]
y_test <- y[-idx]

This split does two jobs at once:

  • the training set is used to fit the regularized models,
  • cross-validation inside the training set is used to choose lambda,
  • the outer test set is held back for one final comparison.

That separation matters. If the same rows were used both to choose lambda and to report final performance, the comparison would be optimistic.

161.3.1 Coefficient Paths

The next figure shows what happens when lambda changes from weak shrinkage to strong shrinkage.

ridge_fit <- glmnet(x_train, y_train, alpha = 0)
lasso_fit <- glmnet(x_train, y_train, alpha = 1)

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
plot(ridge_fit, xvar = "lambda", label = TRUE)
title("Ridge coefficient paths")

plot(lasso_fit, xvar = "lambda", label = TRUE)
title("Lasso coefficient paths")

Ridge and lasso coefficient paths. Moving toward stronger penalties shrinks coefficients; lasso can drive some of them exactly to zero.

The visual difference is the main lesson:

  • ridge shrinks coefficients continuously,
  • lasso shrinks them too, but some paths hit exactly zero,
  • elastic net sits between those two behaviors.

So the question is not only “Which fit is most accurate?” It is also “How much simplification do we want?”

161.3.2 Choosing lambda by Cross-Validation

cv.glmnet() searches over many lambda values and picks the one that performs best under cross-validation. It also reports a more conservative alternative: the one-standard-error rule.

  • lambda.min is the value with the lowest cross-validated error,
  • lambda.1se is the largest value whose error is still within one standard error of the minimum.

In practice, lambda.1se often gives a slightly simpler and more stable model.

ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0, nfolds = 5)
lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 5)
enet_cv <- cv.glmnet(x_train, y_train, alpha = 0.5, nfolds = 5)

summarise_fit <- function(name, cvfit) {
  co <- as.matrix(coef(cvfit, s = "lambda.1se"))
  nz <- sum(abs(co[-1, 1]) > 0)
  pred <- as.numeric(predict(cvfit, newx = x_test, s = "lambda.1se"))
  data.frame(
    Model = name,
    Lambda1SE = cvfit$lambda.1se,
    TestRMSE = rmse(y_test, pred),
    ActiveCoefficients = nz
  )
}

reg_compare <- rbind(
  summarise_fit("Ridge", ridge_cv),
  summarise_fit("Lasso", lasso_cv),
  summarise_fit("Elastic net", enet_cv)
)

knitr::kable(
  transform(
    reg_compare,
    Lambda1SE = round(Lambda1SE, 3),
    TestRMSE = round(TestRMSE, 3)
  ),
  caption = "Outer-test comparison after choosing lambda by inner cross-validation"
)
Outer-test comparison after choosing lambda by inner cross-validation
Model Lambda1SE TestRMSE ActiveCoefficients
Ridge 10.392 4.582 10
Lasso 1.439 4.860 3
Elastic net 2.177 4.805 7

In this run:

  • ridge gives the lowest outer-test RMSE,
  • lasso gives the sparsest model,
  • elastic net sits between them on both complexity and error.

That is the regularization tradeoff in one table: accuracy, simplicity, and stability do not always point to the same choice.

161.3.3 What Shrinkage Looks Like Numerically

ridge_coef <- round(as.matrix(coef(ridge_cv, s = "lambda.1se")), 3)
lasso_coef <- round(as.matrix(coef(lasso_cv, s = "lambda.1se")), 3)

coef_compare <- data.frame(
  Predictor = rownames(ridge_coef),
  Ridge = ridge_coef[, 1],
  Lasso = lasso_coef[, 1]
)

knitr::kable(
  coef_compare,
  caption = "Ridge versus lasso coefficients at lambda.1se"
)
Ridge versus lasso coefficients at lambda.1se
Predictor Ridge Lasso
(Intercept) (Intercept) 18.680 30.261
cyl cyl -0.272 -0.465
disp disp -0.004 0.000
hp hp -0.008 -0.011
drat drat 0.843 0.000
wt wt -0.691 -1.945
qsec qsec 0.138 0.000
vs vs 0.574 0.000
am am 0.964 0.000
gear gear 0.481 0.000
carb carb -0.346 0.000

The ridge model keeps every predictor in the model, but many are small. The lasso model keeps only a few predictors away from zero. That does not mean the dropped variables are scientifically irrelevant. It means the lasso does not need them to optimize this predictive fit at this penalty level.

161.4 The Same Logic in Logistic Regression

Regularization is not limited to linear regression. The same glmnet framework works for binomial outcomes.

library(MASS)
data("Pima.tr", package = "MASS")

x_pima <- model.matrix(type ~ . - 1, data = Pima.tr)
y_pima <- Pima.tr$type

set.seed(123)
lasso_logit <- cv.glmnet(
  x_pima, y_pima,
  family = "binomial",
  alpha = 1,
  nfolds = 5
)

lasso_logit_coef <- round(as.matrix(coef(lasso_logit, s = "lambda.1se")), 3)
lasso_logit_coef <- data.frame(
  Predictor = rownames(lasso_logit_coef),
  Coefficient = lasso_logit_coef[, 1]
)

knitr::kable(
  subset(lasso_logit_coef, Coefficient != 0),
  caption = "Nonzero coefficients from a lasso-logistic fit on Pima.tr"
)
Nonzero coefficients from a lasso-logistic fit on Pima.tr
Predictor Coefficient
(Intercept) (Intercept) -5.498
npreg npreg 0.024
glu glu 0.021
bmi bmi 0.030
ped ped 0.509
age age 0.025

Here the lasso keeps only a subset of the predictors. The practical reading rule is the same as before:

  • this is useful for prediction,
  • it can simplify the model,
  • but it should not be interpreted as a proof that the dropped variables “do not matter” in the underlying phenomenon.

161.5 Try Regularization in the Apps

The apps now expose regularization in two different styles:

  • the app in the menu Models / Manual Model Building lets you turn regularization on explicitly in the GLM and Regression tabs,
  • the Guided Model Building app includes regularized logistic and regularized linear regression as candidate models and reports the selected tuning results in the fitted model details.

161.5.1 Manual Model Building: Explicit Search on the GLM Tab

In the manual app, choose GLM, switch Regularization / hyperparameter search away from None, and then click Fit regularized GLM. The app runs the fit asynchronously and shows a wait/progress message if another heavy search is already running.

Interactive Shiny app (click to load).
Open in new tab

This view is useful because it makes the tuning decision visible. You can switch between ridge, lasso, elastic net, or an automatic penalty-family search and then inspect which coefficients remain active. The same regularization control also appears in the Regression tab for continuous outcomes.

161.5.2 Guided Model Building: Regularized Candidates Inside a Workflow

The guided app treats regularization differently. Instead of asking the learner to launch a large free-form search, it includes regularized coefficient models directly in the candidate set and tunes the penalty family inside the training rows only.

WarningFull-screen use

The Guided Model Building app is still much easier to read in a new tab, but the embedded panel below loads the regularized Pima session directly if you want to inspect the workflow from inside the chapter first.

Interactive Guided Model Building session (click to load).
Open in new tab

In that session, the most important places to look are:

  • Models, where the regularized logistic fit lists the selected penalty family, alpha, lambda.min, lambda.1se, and the nonzero coefficients,
  • Diagnostics, where you can compare its repeated-validation behavior with the other classifiers,
  • Export, where the R script reconstructs the glmnet path transparently.

The manual and guided apps therefore teach two complementary regularization habits:

  • manual workflow: choose and launch the search yourself,
  • guided workflow: compare a tuned shrinkage path against other candidate models inside a controlled validation workflow.

161.6 Practical Reading Rule

Use regularization when the goal is prediction and the coefficient pattern looks too unstable to trust unpenalized fitting.

  • Prefer ridge when you want shrinkage but do not want to drop predictors.
  • Prefer lasso when you also want automatic sparsity.
  • Prefer elastic net when predictors are correlated and pure lasso feels too aggressive.
  • Prefer the one-standard-error rule when a slightly simpler model performs essentially as well as the apparent optimum.

161.7 Practical Exercises

  1. Refit the mtcars example with a different random split. Does the identity of the lowest-RMSE method stay the same?
  2. Compare lambda.min and lambda.1se for the lasso fit. How many active coefficients do you gain or lose?
  3. Change the elastic-net mixing parameter from 0.5 to 0.2 and then to 0.8. How does the coefficient pattern move toward ridge or lasso?
  4. In the Pima.tr logistic example, compare the nonzero coefficient set at lambda.min and lambda.1se. Which rule gives the simpler classifier?
160  Model Validation
162  Hyperparameter Optimization Strategies

© 2026 Patrick Wessa. Provided as-is, without warranty.

Feedback: e-mail | Anonymous contributions: click to copy (Sats) | click to copy (XMR)

Cookie Preferences