165 Leakage, Target Encoding, and Robust Regression

When you build a model, some mistakes make the result look stronger than it really is. Others hide instability until the diagnostics stage. In this chapter you will study the protections that the Guided Model Building app applies inside the workflow and see exactly how they change model results:

leakage protection for target-adjacent predictors,
prediction-time availability metadata,
grouped splitting for repeated entities,
fold-safe target encoding,
Huber robust regression.

These topics are introduced here for the first time. Unlike the methods in earlier chapters (logistic regression, conditional inference trees, ARIMA), they have not been covered in standalone chapters because they are best understood inside the guided workflow where their practical consequences are immediately visible.

The practical question is always the same:

What should the workflow do when a seemingly useful modeling move is methodologically unsafe or too fragile?

165.1 Open the App Full Screen

Full-screen use

These examples are best studied in a full browser tab. The app needs the available width for the ranked predictor guide, diagnostics, and revision comparison tables.

Open the blank Guided Model Building app Open the Cars93 guardrails session Open the Cars93 robust-regression session

The handbook sessions are useful here for different reasons:

Cars93 guardrails opens directly on a leakage problem: Price is the target, but Min.Price and Max.Price have already been attempted as predictors.
Cars93 robust-regression opens a fitted explanatory session in which target skewness (see Section 67.5) and outliers (see Chapter 69) make Huber regression worth comparing.

165.2 Leakage Protection with `Cars93`

Cars93 is a cross-sectional dataset on passenger cars sold in the early 1990s. In this section the modeling goal is predictive: the task is to predict Price without sneaking target-derived information into the predictor set.

Cars93 contains three closely related price variables:

Min.Price
Price
Max.Price

If the target is Price, then Min.Price and Max.Price are tempting predictors. They are also dangerous predictors. They are not ordinary exogenous variables. They are derived from the same pricing information as the target itself.

Code

library(MASS)

price_triplet <- subset(Cars93, select = c(Min.Price, Price, Max.Price))

knitr::kable(
  round(cor(price_triplet, use = "pairwise.complete.obs"), 3),
  caption = "Correlation structure among the three price variables in Cars93"
)

Correlation structure among the three price variables in Cars93
	Min.Price	Price	Max.Price
Min.Price	1.000	0.971	0.907
Price	0.971	1.000	0.982
Max.Price	0.907	0.982	1.000

This is exactly the kind of classroom situation in which a model can appear excellent for the wrong reason. The app therefore treats it as a leakage problem rather than as an innocent correlation problem.

The logic is intentionally conservative:

the predictor name must look closely related to the target name,
the predictor must also have near-deterministic association with the target.

The check catches the most common classroom mistake: accidentally including a variable that is really just a reformulation of the target.

The next table uses the same helper that the app uses internally to decide which predictors should be hidden or blocked.

Code

leakage_screen <- gmb_env$target_adjacent_leakage_predictors(
  data = Cars93,
  target = "Price",
  predictors = c("Min.Price", "Max.Price", "Horsepower", "EngineSize"),
  ruleset = gmb_ruleset
)

knitr::kable(
  leakage_screen,
  digits = 3,
  caption = "Target-adjacent predictors that the app screens out or blocks for Price modeling"
)

Target-adjacent predictors that the app screens out or blocks for Price modeling
Predictor	Method	Association	Action
Max.Price	Spearman	0.984	block
Min.Price	Spearman	0.974	block

165.2.1 What the Handbook Session Lets You Observe

If you open the Cars93 guardrails session, the app opens on the Audit step with the suspect predictors already attempted. That means you can see the leakage rule fire before any model is fitted.

The same block can be reproduced directly in R:

Code

leakage_audit <- gmb_env$compute_audit(
  data = Cars93,
  target = "Price",
  predictors = c("Horsepower", "Min.Price", "Max.Price"),
  row_order_meaningful = FALSE,
  time_index_variable = NULL,
  seasonal_period = 1,
  ruleset = gmb_ruleset
)

leakage_blocks <- do.call(
  rbind,
  lapply(leakage_audit$blocks, function(x) {
    data.frame(
      code = x$code,
      title = x$title,
      message = x$message,
      stringsAsFactors = FALSE
    )
  })
)

knitr::kable(
  leakage_blocks,
  caption = "Audit blocks raised when target-adjacent price variables are forced into the Cars93 model"
)

Audit blocks raised when target-adjacent price variables are forced into the Cars93 model
code	title	message
target_adjacent_leakage	Likely target-derived leakage detected	Predictor(s) Max.Price (0.984), Min.Price (0.974) appear derived from the target name ‘Price’ and have near-deterministic association with it. Remove them before fitting to avoid leakage.

This is deliberate. The workflow should not reward the learner for using information that is too close to the outcome itself.

The leakage rule is a configurable ruleset item with a strict or warn mode, so the reasoning behind the block is visible rather than hidden inside the code.

165.3 Prediction-Time Availability Is Not a Yes/No Question

Some variables are problematic even when they are not literal copies of the target. A field can be measured before the outcome and still be operationally unusable because it is published with a delay, only exists in a retrospective archive, or is effectively known only after the event of interest.

The app therefore lets the user classify predictors into availability buckets:

Available only with delay
Retrospective only / not deployable
Known only after the outcome
Derived from the outcome / target-adjacent

The same declared status is interpreted differently by goal. The table below shows the current logic.

Code

availability_demo <- expand.grid(
  status = c("available_with_delay", "retrospective_only", "post_outcome", "target_adjacent"),
  goal = c("prediction", "explanation_confirmation"),
  stringsAsFactors = FALSE
)
availability_demo$action <- mapply(
  gmb_env$predictor_availability_action,
  status = availability_demo$status,
  goal = availability_demo$goal
)
availability_demo$status <- vapply(availability_demo$status, gmb_env$availability_class_label, character(1))
availability_demo$goal <- ifelse(
  availability_demo$goal == "prediction",
  "Prediction",
  "Explanation / Confirmation"
)

knitr::kable(
  availability_demo,
  caption = "How the app interprets predictor availability declarations by modeling goal"
)

How the app interprets predictor availability declarations by modeling goal
status	goal	action
Available only with delay	Prediction	warn
Retrospective only / not deployable	Prediction	block
Known only after the outcome	Prediction	block
Derived from the outcome / target-adjacent	Prediction	block
Available only with delay	Explanation / Confirmation	info
Retrospective only / not deployable	Explanation / Confirmation	info
Known only after the outcome	Explanation / Confirmation	block
Derived from the outcome / target-adjacent	Explanation / Confirmation	block

For prediction, delayed variables create a warning and retrospective or post-outcome variables create a block. For explanation, the app remains strict about post-outcome and target-adjacent variables, but it allows the user to keep retrospective variables as long as their scientific role is explicit.

This is a useful teaching distinction. A variable can be scientifically interesting in a retrospective explanatory study and still be unusable in a real deployment.

165.4 Grouped Data Need Grouped Splits

Another leakage problem appears when rows are not independent. If the same patient, student, household, or firm contributes multiple rows, a naive random split can place one row in training and another row from the same entity in testing. The result is an overly optimistic validation score because the model has effectively seen part of the test unit already.

The app therefore includes an optional Group / entity variable in the Data step. Once it is set, repeated holdout keeps all rows from the same group in the same split.

In the toy data below, the grouping variable is patient. Each patient appears twice, so the rows are not independent observations from unrelated people. If one row from P3 were used for training and the second row from P3 were used for testing, the test result would no longer be a clean check on a genuinely unseen patient.

The following toy example uses the same split helper as the app.

Code

group_demo <- data.frame(
  patient = rep(paste0("P", 1:8), each = 2),
  outcome = factor(rep(c("neg", "pos"), each = 8)),
  marker = c(2.1, 2.3, 2.8, 3.0, 1.9, 2.0, 3.2, 3.3, 2.4, 2.5, 3.4, 3.6, 1.7, 1.8, 3.1, 3.0)
)

knitr::kable(
  data.frame(
    row_id = seq_len(nrow(group_demo)),
    group_demo
  ),
  caption = "Toy classification data with patient as the grouping variable"
)

Toy classification data with patient as the grouping variable
row_id	patient	outcome	marker
1	P1	neg	2.1
2	P1	neg	2.3
3	P2	neg	2.8
4	P2	neg	3.0
5	P3	neg	1.9
6	P3	neg	2.0
7	P4	neg	3.2
8	P4	neg	3.3
9	P5	pos	2.4
10	P5	pos	2.5
11	P6	pos	3.4
12	P6	pos	3.6
13	P7	pos	1.7
14	P7	pos	1.8
15	P8	pos	3.1
16	P8	pos	3.0

Code

group_split <- gmb_env$split_grouped_classification_indices(
  data = group_demo,
  target = "outcome",
  group_variable = "patient",
  train_fraction = 0.75,
  seed = 123
)

group_row_assignment <- data.frame(
  row_id = seq_len(nrow(group_demo)),
  patient = group_demo$patient,
  outcome = group_demo$outcome,
  marker = group_demo$marker,
  split = ifelse(seq_len(nrow(group_demo)) %in% group_split$train_idx, "train", "test")
)

knitr::kable(
  group_row_assignment,
  caption = "Grouped holdout assignment: once patient is declared as the grouping variable, both rows for each patient stay together"
)

Grouped holdout assignment: once patient is declared as the grouping variable, both rows for each patient stay together
row_id	patient	outcome	marker	split
1	P1	neg	2.1	test
2	P1	neg	2.3	test
3	P2	neg	2.8	train
4	P2	neg	3.0	train
5	P3	neg	1.9	train
6	P3	neg	2.0	train
7	P4	neg	3.2	train
8	P4	neg	3.3	train
9	P5	pos	2.4	test
10	P5	pos	2.5	test
11	P6	pos	3.4	train
12	P6	pos	3.6	train
13	P7	pos	1.7	train
14	P7	pos	1.8	train
15	P8	pos	3.1	train
16	P8	pos	3.0	train

The practical lesson is simple: if the unit of prediction is a patient, then the split must respect patient boundaries too.

165.5 Fold-Safe Target Encoding

Target encoding replaces each level of a categorical predictor with the mean target value for that level. It is useful when a categorical predictor contains many levels and a purely dummy-based treatment becomes awkward or unstable. But it is also a classic place where leakage can be introduced accidentally.

Before looking at the leakage issue, it helps to compare target encoding with the more familiar dummy approach.

165.5.1 Dummy Variables Versus Target Encoding

Suppose a categorical predictor Manufacturer has three levels: Acura, Cadillac, and Geo.

With dummy coding (also called one-hot encoding), the model receives one indicator column per level, or one fewer if a reference category is omitted.
With target encoding, the model receives one numeric column whose value is the mean target for that category.

The following toy example shows the difference row by row.

Code

encoding_toy <- data.frame(
  Manufacturer = factor(c("Acura", "Cadillac", "Geo", "Acura", "Geo", "Cadillac")),
  Price = c(15, 32, 10, 16, 11, 30)
)

toy_dummies <- data.frame(model.matrix(~ Manufacturer - 1, data = encoding_toy))
toy_target_map <- tapply(encoding_toy$Price, encoding_toy$Manufacturer, mean)
toy_target_encoded <- unname(toy_target_map[as.character(encoding_toy$Manufacturer)])

toy_compare <- cbind(
  encoding_toy,
  toy_dummies,
  TargetEncodedManufacturer = toy_target_encoded
)

knitr::kable(
  toy_compare,
  digits = 2,
  caption = "One-hot encoding and target encoding for the same categorical predictor"
)

One-hot encoding and target encoding for the same categorical predictor
Manufacturer	Price	ManufacturerAcura	ManufacturerCadillac	ManufacturerGeo	TargetEncodedManufacturer
Acura	15	1	0	0	15.5
Cadillac	32	0	1	0	31.0
Geo	10	0	0	1	10.5
Acura	16	1	0	0	15.5
Geo	11	0	0	1	10.5
Cadillac	30	0	1	0	31.0

Dummy coding is usually easier to explain because every level keeps its own column. Target encoding becomes attractive when the categorical predictor has many levels and a dummy-based treatment would create a wide and unstable design matrix. The cost is that the target itself now participates in the encoding, so leakage control becomes critical.

The unsafe version is easy to describe:

compute the mean target value for each category on the full dataset,
replace the category by that mean,
validate the model.

That procedure leaks target information from the validation rows back into the encoded predictor.

The app therefore uses training-only target encoding inside each validation split.

165.5.2 A Concrete Encoding Illustration with `Cars93` Manufacturers

Cars93 also contains a manufacturer variable that can be encoded as a target-aware summary for predictive price modeling. The crucial point is that the summary must be computed from the training rows only.

In other words:

dummy coding asks only, “Which manufacturer is this row?”
target encoding asks, “What was the average target value for this manufacturer in the training rows?”

That second question is exactly why leakage can occur if the validation rows are allowed to influence the encoding.

Code

set.seed(123)

cars_te <- subset(Cars93, select = c(Price, Manufacturer))
idx <- sample(seq_len(nrow(cars_te)), size = floor(0.8 * nrow(cars_te)))

train_te <- cars_te[idx, ]
test_te <- cars_te[-idx, ]

global_mean <- mean(train_te$Price, na.rm = TRUE)
full_map <- tapply(cars_te$Price, cars_te$Manufacturer, mean, na.rm = TRUE)
train_map <- tapply(train_te$Price, train_te$Manufacturer, mean, na.rm = TRUE)

test_manufacturer <- as.character(test_te$Manufacturer)
encoded_compare <- data.frame(
  Manufacturer = test_manufacturer,
  FullDataEncoding = unname(full_map[test_manufacturer]),
  TrainOnlyEncoding = unname(train_map[test_manufacturer]),
  stringsAsFactors = FALSE
)
encoded_compare$TrainOnlyEncoding[is.na(encoded_compare$TrainOnlyEncoding)] <- global_mean
encoded_compare$Difference <- encoded_compare$FullDataEncoding - encoded_compare$TrainOnlyEncoding

knitr::kable(
  head(encoded_compare[order(-abs(encoded_compare$Difference)), ], 10),
  digits = 3,
  caption = "Full-data target encoding versus training-only target encoding on the test rows"
)

Full-data target encoding versus training-only target encoding on the test rows
	Manufacturer	FullDataEncoding	TrainOnlyEncoding	Difference
3	Cadillac	37.400	19.500	17.900
4	Cadillac	37.400	19.500	17.900
1	Acura	24.900	19.500	5.400
2	Acura	24.900	19.500	5.400
17	Subaru	12.933	9.650	3.283
7	Geo	10.450	8.400	2.050
18	Toyota	17.275	15.467	1.808
19	Volkswagen	18.025	16.267	1.758
14	Nissan	17.025	18.767	-1.742
12	Mazda	17.600	19.100	-1.500

The column FullDataEncoding is the unsafe version. The column TrainOnlyEncoding is the fold-safe version that the app uses during validation.

The difference is not cosmetic. If the encoding is built on the full dataset, the model has already been allowed to see the average target behavior of the rows it is supposed to predict later.

If a category appears in the test rows but not in the training rows, the app does not borrow information from the validation data. It falls back to the training-set global mean instead.

165.6 Huber Robust Regression

The app includes Huber robust regression as a standard regression option. This matters when ordinary least squares (as used in Chapter 135) is being driven too strongly by a small number of extreme target values.

Ordinary least squares minimizes the sum of squared residuals, which means a single extreme observation can pull the fitted line substantially. Huber regression uses a modified loss function: residuals within a threshold are squared as usual, but residuals beyond that threshold are penalized linearly instead of quadratically. The result is that extreme observations receive less influence over the fit.

This does not replace the descriptive discussion of winsorization in the central tendency material. It addresses a different question:

If the target contains influential extremes, should the workflow compare a model whose fitting criterion is itself more robust?

165.6.1 Coefficient Comparison in the `Cars93` Example

The Cars93 robust-regression session uses a compact explanatory model with:

Horsepower
Fuel.tank.capacity
EngineSize
AirBags

The following code compares ordinary least squares and Huber regression on that same formula.

Code

cars93_model_data <- subset(
  Cars93,
  select = c(Price, Horsepower, Fuel.tank.capacity, EngineSize, AirBags)
)

ols_fit <- lm(
  Price ~ Horsepower + Fuel.tank.capacity + EngineSize + AirBags,
  data = cars93_model_data
)

huber_fit <- MASS::rlm(
  Price ~ Horsepower + Fuel.tank.capacity + EngineSize + AirBags,
  data = cars93_model_data
)

coef_compare <- data.frame(
  Term = names(coef(ols_fit)),
  OLS = unname(coef(ols_fit)),
  Huber = unname(coef(huber_fit))
)
coef_compare$OLS <- round(coef_compare$OLS, 4)
coef_compare$Huber <- round(coef_compare$Huber, 4)

knitr::kable(
  coef_compare,
  caption = "Ordinary least squares and Huber coefficient estimates for the Cars93 example"
)

Ordinary least squares and Huber coefficient estimates for the Cars93 example
Term	OLS	Huber
(Intercept)	1.2838	-1.1604
Horsepower	0.1103	0.1103
Fuel.tank.capacity	0.4921	0.5377
EngineSize	-0.6504	-0.9022
AirBagsDriver only	-3.3891	-1.7059
AirBagsNone	-6.9470	-4.5975

The app also reports how much of the sample received reduced weight under the Huber fit.

Code

huber_weights <- gmb_env$huber_weight_summary(huber_fit)

knitr::kable(
  data.frame(
    quantity = c("Observations", "Downweighted", "Share downweighted", "Minimum weight", "Median weight"),
    value = c(
      huber_weights$n,
      huber_weights$downweighted,
      round(huber_weights$share, 3),
      round(huber_weights$min_weight, 3),
      round(huber_weights$median_weight, 3)
    )
  ),
  caption = "How strongly the Huber fit downweights observations in the Cars93 example"
)

How strongly the Huber fit downweights observations in the Cars93 example
quantity	value
Observations	93.000
Downweighted	22.000
Share downweighted	0.237
Minimum weight	0.128
Median weight	1.000

Huber regression does not always replace ordinary least squares. The point is that when the audit has already signaled target outliers, a robust fitting criterion deserves a place in the comparison. The weights give you a direct way to see that this is not abstract theory: some observations really do count less in the robust fit.

That is why the app suggests Huber regression as a candidate when the target-outlier warning is active.

165.7 How These Guardrails Fit the Workflow

These protections belong together because they protect different parts of the workflow:

Guardrail	Main risk	App response
Leakage protection	target-adjacent predictors make the model look stronger than it really is	hide or block suspect predictors before fitting
Availability metadata	a variable is scientifically interesting but not really available at the decision cutoff	warn or block based on goal and declared status
Grouped splitting	train/test rows from the same entity leak information across the split	keep all rows from the same group in the same split
Fold-safe target encoding	preprocessing itself leaks target information into validation	compute encodings inside each training split
Huber regression	a few extreme target values dominate ordinary least squares	compare a robust regression criterion instead of only changing the data

This chapter therefore extends Chapter 163 and Chapter 164. The earlier chapters show how the workflow is used. This chapter shows how the workflow protects itself against some of its own most dangerous failure modes.

165.8 Practical Exercises

Open the Cars93 guardrails session and record the exact audit block that fires before any model is fitted.
Pick one predictor from a dataset you know well and decide whether it is immediately available, delayed, retrospective-only, or post-outcome. Explain why.
Reproduce the target-encoding comparison above and explain why the full-data encoding is not acceptable for validation.
Compare the lm and Huber coefficients for the Cars93 formula. Which coefficients move the most?
In your own words, explain the difference between a preprocessing guardrail, a split-design guardrail, and a model-fitting guardrail.

165.1 Open the App Full Screen

165.2 Leakage Protection with Cars93

165.2.1 What the Handbook Session Lets You Observe

165.3 Prediction-Time Availability Is Not a Yes/No Question

165.4 Grouped Data Need Grouped Splits

165.5 Fold-Safe Target Encoding

165.5.1 Dummy Variables Versus Target Encoding

165.5.2 A Concrete Encoding Illustration with Cars93 Manufacturers

165.6 Huber Robust Regression

165.6.1 Coefficient Comparison in the Cars93 Example

165.7 How These Guardrails Fit the Workflow

165.8 Practical Exercises

165.2 Leakage Protection with `Cars93`

165.5.2 A Concrete Encoding Illustration with `Cars93` Manufacturers

165.6.1 Coefficient Comparison in the `Cars93` Example