161 Regularization Methods

Regularization is used when a regression or classification model is flexible enough to fit the training data too aggressively. The idea is simple: do not let the coefficients move as freely as ordinary fitting would allow.

Instead of minimizing only the residual sum of squares or the negative log-likelihood, a regularized method adds a penalty for large coefficients. The penalty discourages unstable fits and usually improves out-of-sample performance when predictors are numerous, correlated, or both.

This chapter uses glmnet because it provides the three standard penalties in one framework:

Method	Penalty idea	Practical effect
Ridge	shrink all coefficients toward zero	keeps every predictor, reduces instability
Lasso	shrink coefficients and allow some to become exactly zero	combines shrinkage with variable selection
Elastic net	blend ridge and lasso	useful when predictors are both numerous and correlated

161.1 Why Coefficients Need Shrinkage

When predictors are strongly related to one another, ordinary least squares and ordinary logistic regression can produce coefficients that move around a lot from one sample to the next. The fitted values may still look reasonable, but the individual coefficients become harder to trust.

Regularization does not eliminate the need for good data, sensible predictors, or leakage protection. It does something narrower:

it reduces coefficient volatility,
it lowers the chance that the model chases noise,
it often improves predictive performance on new data.

This is why regularization belongs more naturally to a predictive workflow than to a confirmatory one (see Section 158.2). A lasso model that sets some coefficients to zero can be useful, but it should not be mistaken for a substantive scientific argument about causation.

161.2 The Three Standard Penalties

For linear regression, the ordinary least squares objective is

\[ \sum_{i=1}^{n}(y_i - \hat y_i)^2. \]

Regularization adds a penalty term:

\[ \sum_{i=1}^{n}(y_i - \hat y_i)^2 + \lambda \cdot \text{Penalty}(\beta). \]

The tuning constant \(\lambda\) controls how strongly coefficients are shrunk.

Ridge regression uses \(\sum_j \beta_j^2\).
Lasso regression uses \(\sum_j |\beta_j|\).
Elastic net uses a weighted blend of the two.

The second tuning constant is therefore alpha:

alpha = 0 gives ridge,
alpha = 1 gives lasso,
values between 0 and 1 give elastic net.

The key practical point is that the data do not estimate lambda for you automatically. It must be chosen by validation, which is why this chapter connects directly to Chapter 160 and Chapter 162.

161.3 A Small Regression Example with `mtcars`

The mtcars dataset is small enough that ordinary fitting can be sensitive to the exact training sample. That makes it a convenient teaching example for shrinkage.

library(glmnet)

data(mtcars)

x <- model.matrix(mpg ~ . - 1, data = mtcars)
y <- mtcars$mpg

rmse <- function(actual, pred) sqrt(mean((actual - pred)^2))

set.seed(42)
idx <- sample(seq_len(nrow(mtcars)), size = 24)

x_train <- x[idx, ]
x_test <- x[-idx, ]
y_train <- y[idx]
y_test <- y[-idx]

This split does two jobs at once:

the training set is used to fit the regularized models,
cross-validation inside the training set is used to choose lambda,
the outer test set is held back for one final comparison.

That separation matters. If the same rows were used both to choose lambda and to report final performance, the comparison would be optimistic.

161.3.1 Coefficient Paths

The next figure shows what happens when lambda changes from weak shrinkage to strong shrinkage.

ridge_fit <- glmnet(x_train, y_train, alpha = 0)
lasso_fit <- glmnet(x_train, y_train, alpha = 1)

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
plot(ridge_fit, xvar = "lambda", label = TRUE)
title("Ridge coefficient paths")

plot(lasso_fit, xvar = "lambda", label = TRUE)
title("Lasso coefficient paths")

Ridge and lasso coefficient paths. Moving toward stronger penalties shrinks coefficients; lasso can drive some of them exactly to zero.

The visual difference is the main lesson:

ridge shrinks coefficients continuously,
lasso shrinks them too, but some paths hit exactly zero,
elastic net sits between those two behaviors.

So the question is not only “Which fit is most accurate?” It is also “How much simplification do we want?”

161.3.2 Choosing `lambda` by Cross-Validation

cv.glmnet() searches over many lambda values and picks the one that performs best under cross-validation. It also reports a more conservative alternative: the one-standard-error rule.

lambda.min is the value with the lowest cross-validated error,
lambda.1se is the largest value whose error is still within one standard error of the minimum.

In practice, lambda.1se often gives a slightly simpler and more stable model.

ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0, nfolds = 5)
lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 5)
enet_cv <- cv.glmnet(x_train, y_train, alpha = 0.5, nfolds = 5)

summarise_fit <- function(name, cvfit) {
  co <- as.matrix(coef(cvfit, s = "lambda.1se"))
  nz <- sum(abs(co[-1, 1]) > 0)
  pred <- as.numeric(predict(cvfit, newx = x_test, s = "lambda.1se"))
  data.frame(
    Model = name,
    Lambda1SE = cvfit$lambda.1se,
    TestRMSE = rmse(y_test, pred),
    ActiveCoefficients = nz
  )
}

reg_compare <- rbind(
  summarise_fit("Ridge", ridge_cv),
  summarise_fit("Lasso", lasso_cv),
  summarise_fit("Elastic net", enet_cv)
)

knitr::kable(
  transform(
    reg_compare,
    Lambda1SE = round(Lambda1SE, 3),
    TestRMSE = round(TestRMSE, 3)
  ),
  caption = "Outer-test comparison after choosing lambda by inner cross-validation"
)

Outer-test comparison after choosing lambda by inner cross-validation
Model	Lambda1SE	TestRMSE	ActiveCoefficients
Ridge	10.392	4.582	10
Lasso	1.439	4.860	3
Elastic net	2.177	4.805	7

In this run:

ridge gives the lowest outer-test RMSE,
lasso gives the sparsest model,
elastic net sits between them on both complexity and error.

That is the regularization tradeoff in one table: accuracy, simplicity, and stability do not always point to the same choice.

161.3.3 What Shrinkage Looks Like Numerically

ridge_coef <- round(as.matrix(coef(ridge_cv, s = "lambda.1se")), 3)
lasso_coef <- round(as.matrix(coef(lasso_cv, s = "lambda.1se")), 3)

coef_compare <- data.frame(
  Predictor = rownames(ridge_coef),
  Ridge = ridge_coef[, 1],
  Lasso = lasso_coef[, 1]
)

knitr::kable(
  coef_compare,
  caption = "Ridge versus lasso coefficients at lambda.1se"
)

Ridge versus lasso coefficients at lambda.1se
	Predictor	Ridge	Lasso
(Intercept)	(Intercept)	18.680	30.261
cyl	cyl	-0.272	-0.465
disp	disp	-0.004	0.000
hp	hp	-0.008	-0.011
drat	drat	0.843	0.000
wt	wt	-0.691	-1.945
qsec	qsec	0.138	0.000
vs	vs	0.574	0.000
am	am	0.964	0.000
gear	gear	0.481	0.000
carb	carb	-0.346	0.000

The ridge model keeps every predictor in the model, but many are small. The lasso model keeps only a few predictors away from zero. That does not mean the dropped variables are scientifically irrelevant. It means the lasso does not need them to optimize this predictive fit at this penalty level.

161.4 The Same Logic in Logistic Regression

Regularization is not limited to linear regression. The same glmnet framework works for binomial outcomes.

library(MASS)
data("Pima.tr", package = "MASS")

x_pima <- model.matrix(type ~ . - 1, data = Pima.tr)
y_pima <- Pima.tr$type

set.seed(123)
lasso_logit <- cv.glmnet(
  x_pima, y_pima,
  family = "binomial",
  alpha = 1,
  nfolds = 5
)

lasso_logit_coef <- round(as.matrix(coef(lasso_logit, s = "lambda.1se")), 3)
lasso_logit_coef <- data.frame(
  Predictor = rownames(lasso_logit_coef),
  Coefficient = lasso_logit_coef[, 1]
)

knitr::kable(
  subset(lasso_logit_coef, Coefficient != 0),
  caption = "Nonzero coefficients from a lasso-logistic fit on Pima.tr"
)

Nonzero coefficients from a lasso-logistic fit on Pima.tr
	Predictor	Coefficient
(Intercept)	(Intercept)	-5.498
npreg	npreg	0.024
glu	glu	0.021
bmi	bmi	0.030
ped	ped	0.509
age	age	0.025

Here the lasso keeps only a subset of the predictors. The practical reading rule is the same as before:

this is useful for prediction,
it can simplify the model,
but it should not be interpreted as a proof that the dropped variables “do not matter” in the underlying phenomenon.

161.5 Try Regularization in the Apps

The apps now expose regularization in two different styles:

the app in the menu Models / Manual Model Building lets you turn regularization on explicitly in the GLM and Regression tabs,
the Guided Model Building app includes regularized logistic and regularized linear regression as candidate models and reports the selected tuning results in the fitted model details.

161.5.1 Manual Model Building: Explicit Search on the GLM Tab

In the manual app, choose GLM, switch Regularization / hyperparameter search away from None, and then click Fit regularized GLM. The app runs the fit asynchronously and shows a wait/progress message if another heavy search is already running.

Interactive Shiny app (click to load).

Open in new tab

This view is useful because it makes the tuning decision visible. You can switch between ridge, lasso, elastic net, or an automatic penalty-family search and then inspect which coefficients remain active. The same regularization control also appears in the Regression tab for continuous outcomes.

161.5.2 Guided Model Building: Regularized Candidates Inside a Workflow

The guided app treats regularization differently. Instead of asking the learner to launch a large free-form search, it includes regularized coefficient models directly in the candidate set and tunes the penalty family inside the training rows only.

Full-screen use

The Guided Model Building app is still much easier to read in a new tab, but the embedded panel below loads the regularized Pima session directly if you want to inspect the workflow from inside the chapter first.

Interactive Guided Model Building session (click to load).

Open in new tab

In that session, the most important places to look are:

Models, where the regularized logistic fit lists the selected penalty family, alpha, lambda.min, lambda.1se, and the nonzero coefficients,
Diagnostics, where you can compare its repeated-validation behavior with the other classifiers,
Export, where the R script reconstructs the glmnet path transparently.

The manual and guided apps therefore teach two complementary regularization habits:

manual workflow: choose and launch the search yourself,
guided workflow: compare a tuned shrinkage path against other candidate models inside a controlled validation workflow.

161.6 Practical Reading Rule

Use regularization when the goal is prediction and the coefficient pattern looks too unstable to trust unpenalized fitting.

Prefer ridge when you want shrinkage but do not want to drop predictors.
Prefer lasso when you also want automatic sparsity.
Prefer elastic net when predictors are correlated and pure lasso feels too aggressive.
Prefer the one-standard-error rule when a slightly simpler model performs essentially as well as the apparent optimum.

161.7 Practical Exercises

Refit the mtcars example with a different random split. Does the identity of the lowest-RMSE method stay the same?
Compare lambda.min and lambda.1se for the lasso fit. How many active coefficients do you gain or lose?
Change the elastic-net mixing parameter from 0.5 to 0.2 and then to 0.8. How does the coefficient pattern move toward ridge or lasso?
In the Pima.tr logistic example, compare the nonzero coefficient set at lambda.min and lambda.1se. Which rule gives the simpler classifier?

161.1 Why Coefficients Need Shrinkage

161.2 The Three Standard Penalties

161.3 A Small Regression Example with mtcars

161.3.1 Coefficient Paths

161.3.2 Choosing lambda by Cross-Validation

161.3.3 What Shrinkage Looks Like Numerically

161.4 The Same Logic in Logistic Regression

161.5 Try Regularization in the Apps

161.5.1 Manual Model Building: Explicit Search on the GLM Tab

161.5.2 Guided Model Building: Regularized Candidates Inside a Workflow

161.6 Practical Reading Rule

161.7 Practical Exercises

161.3 A Small Regression Example with `mtcars`

161.3.2 Choosing `lambda` by Cross-Validation