• Descriptive
    • Moments
    • Concentration
    • Central Tendency
    • Variability
    • Stem-and-Leaf Plot
    • Histogram & Frequency Table
    • Data Quality Forensics
    • Conditional EDA
    • Quantiles
    • Kernel Density Estimation
    • Normal QQ Plot
    • Bootstrap Plot

    • Multivariate Descriptive Statistics
  • Distributions
    • Binomial Probabilities
    • Geometric Probabilities
    • Negative Binomial Probabilities
    • Hypergeometric Probabilities
    • Multinomial Probabilities
    • Dirichlet
    • Poisson Probabilities

    • Exponential
    • Gamma
    • Erlang
    • Weibull
    • Rayleigh
    • Maxwell-Boltzmann
    • Lognormal
    • Pareto
    • Inverse Gamma
    • Inverse Chi-Square

    • Beta
    • Power
    • Beta Prime (Inv. Beta)
    • Triangular

    • Normal (area)
    • Logistic
    • Laplace
    • Cauchy (standard)
    • Cauchy (location-scale)
    • Gumbel
    • Fréchet
    • Generalized Extreme Value

    • Normal RNG
    • ML Fitting
    • Tukey Lambda PPCC
    • Box-Cox Normality Plot
    • Noncentral t
    • Noncentral F
    • Sample Correlation r

    • Empirical Tests
  • Hypotheses
    • Theoretical Aspects of Hypothesis Testing
    • Bayesian Inference
    • Minimum Sample Size

    • Empirical Tests
    • Multivariate (pair-wise) Testing
  • Models
    • Manual Model Building
    • Guided Model Building
  • Time Series
    • Time Series Plot
    • Decomposition
    • Exponential Smoothing

    • Blocked Bootstrap Plot
    • Mean Plot
    • (P)ACF
    • VRM
    • Standard Deviation-Mean Plot
    • Spectral Analysis
    • ARIMA

    • Cross Correlation Function
    • Granger Causality
  1. Model Building Strategies
  2. 165  Leakage, Target Encoding, and Robust Regression
  • Preface
  • Getting Started
    • 1  Introduction
    • 2  Why Do We Need Innovative Technology?
    • 3  Basic Definitions
    • 4  The Big Picture: Why We Analyze Data
  • Introduction to Probability
    • 5  Definitions of Probability
    • 6  Jeffreys’ axiom system
    • 7  Bayes’ Theorem
    • 8  Sensitivity and Specificity
    • 9  Naive Bayes Classifier
    • 10  Law of Large Numbers

    • 11  Problems
  • Probability Distributions
    • 12  Bernoulli Distribution
    • 13  Binomial Distribution
    • 14  Geometric Distribution
    • 15  Negative Binomial Distribution
    • 16  Hypergeometric Distribution
    • 17  Multinomial Distribution
    • 18  Poisson Distribution

    • 19  Uniform Distribution (Rectangular Distribution)
    • 20  Normal Distribution (Gaussian Distribution)
    • 21  Gaussian Naive Bayes Classifier
    • 22  Chi Distribution
    • 23  Chi-squared Distribution (1 parameter)
    • 24  Chi-squared Distribution (2 parameters)
    • 25  Student t-Distribution
    • 26  Fisher F-Distribution
    • 27  Exponential Distribution
    • 28  Lognormal Distribution
    • 29  Gamma Distribution
    • 30  Beta Distribution
    • 31  Weibull Distribution
    • 32  Pareto Distribution
    • 33  Inverse Gamma Distribution
    • 34  Rayleigh Distribution
    • 35  Erlang Distribution
    • 36  Logistic Distribution
    • 37  Laplace Distribution
    • 38  Gumbel Distribution
    • 39  Cauchy Distribution
    • 40  Triangular Distribution
    • 41  Power Distribution
    • 42  Beta Prime Distribution
    • 43  Sample Correlation Distribution
    • 44  Dirichlet Distribution
    • 45  Generalized Extreme Value (GEV) Distribution
    • 46  Frechet Distribution
    • 47  Noncentral t Distribution
    • 48  Noncentral F Distribution
    • 49  Inverse Chi-Squared Distribution
    • 50  Maxwell-Boltzmann Distribution
    • 51  Distribution Relationship Map

    • 52  Problems
  • Descriptive Statistics & Exploratory Data Analysis
    • 53  Types of Data
    • 54  Datasheets

    • 55  Frequency Plot (Bar Plot)
    • 56  Frequency Table
    • 57  Contingency Table
    • 58  Binomial Classification Metrics
    • 59  Confusion Matrix
    • 60  ROC Analysis

    • 61  Stem-and-Leaf Plot
    • 62  Histogram
    • 63  Data Quality Forensics
    • 64  Quantiles
    • 65  Central Tendency
    • 66  Variability
    • 67  Skewness & Kurtosis
    • 68  Concentration
    • 69  Notched Boxplot
    • 70  Scatterplot
    • 71  Pearson Correlation
    • 72  Rank Correlation
    • 73  Partial Pearson Correlation
    • 74  Simple Linear Regression
    • 75  Moments
    • 76  Quantile-Quantile Plot (QQ Plot)
    • 77  Normal Probability Plot
    • 78  Probability Plot Correlation Coefficient Plot (PPCC Plot)
    • 79  Box-Cox Normality Plot
    • 80  Kernel Density Estimation
    • 81  Bivariate Kernel Density Plot
    • 82  Conditional EDA: Panel Diagnostics
    • 83  Bootstrap Plot (Central Tendency)
    • 84  Survey Scores Rank Order Comparison
    • 85  Cronbach Alpha

    • 86  Equi-distant Time Series
    • 87  Time Series Plot (Run Sequence Plot)
    • 88  Mean Plot
    • 89  Blocked Bootstrap Plot (Central Tendency)
    • 90  Standard Deviation-Mean Plot
    • 91  Variance Reduction Matrix
    • 92  (Partial) Autocorrelation Function
    • 93  Periodogram & Cumulative Periodogram

    • 94  Problems
  • Hypothesis Testing
    • 95  Normal Distributions revisited
    • 96  The Population
    • 97  The Sample
    • 98  The One-Sided Hypothesis Test
    • 99  The Two-Sided Hypothesis Test
    • 100  When to use a one-sided or two-sided test?
    • 101  What if \(\sigma\) is unknown?
    • 102  The Central Limit Theorem (revisited)
    • 103  Statistical Test of the Population Mean with known Variance
    • 104  Statistical Test of the Population Mean with unknown Variance
    • 105  Statistical Test of the Variance
    • 106  Statistical Test of the Population Proportion
    • 107  Statistical Test of the Standard Deviation \(\sigma\)
    • 108  Statistical Test of the difference between Means -- Independent/Unpaired Samples
    • 109  Statistical Test of the difference between Means -- Dependent/Paired Samples
    • 110  Statistical Test of the difference between Variances -- Independent/Unpaired Samples

    • 111  Hypothesis Testing for Research Purposes
    • 112  Decision Thresholds, Alpha, and Confidence Levels
    • 113  Bayesian Inference for Decision-Making
    • 114  One Sample t-Test
    • 115  Skewness & Kurtosis Tests
    • 116  Paired Two Sample t-Test
    • 117  Wilcoxon Signed-Rank Test
    • 118  Unpaired Two Sample t-Test
    • 119  Unpaired Two Sample Welch Test
    • 120  Two One-Sided Tests (TOST) for Equivalence
    • 121  Mann-Whitney U test (Wilcoxon Rank-Sum Test)
    • 122  Bayesian Two Sample Test
    • 123  Median Test based on Notched Boxplots
    • 124  Chi-Squared Tests for Count Data
    • 125  Kolmogorov-Smirnov Test
    • 126  One Way Analysis of Variance (1-way ANOVA)
    • 127  Kruskal-Wallis Test
    • 128  Two Way Analysis of Variance (2-way ANOVA)
    • 129  Repeated Measures ANOVA
    • 130  Friedman Test
    • 131  Testing Correlations
    • 132  A Note on Causality

    • 133  Problems
  • Regression Models
    • 134  Simple Linear Regression Model (SLRM)
    • 135  Multiple Linear Regression Model (MLRM)
    • 136  Logistic Regression
    • 137  Generalized Linear Models
    • 138  Multinomial and Ordinal Logistic Regression
    • 139  Cox Proportional Hazards Regression
    • 140  Conditional Inference Trees
    • 141  Leaf Diagnostics for Conditional Inference Trees
    • 142  Conditional Random Forests
    • 143  Hypothesis Testing with Linear Regression Models (from a Practical Point of View)

    • 144  Problems
  • Introduction to Time Series Analysis
    • 145  Case: the Market of Health and Personal Care Products
    • 146  Decomposition of Time Series
    • 147  Ad hoc Forecasting of Time Series
  • Box-Jenkins Analysis
    • 148  Introduction to Box-Jenkins Analysis
    • 149  Theoretical Concepts
    • 150  Stationarity
    • 151  Identifying ARMA parameters
    • 152  Estimating ARMA Parameters and Residual Diagnostics
    • 153  Forecasting with ARIMA models
    • 154  Intervention Analysis
    • 155  Cross-Correlation Function
    • 156  Transfer Function Noise Models
    • 157  General-to-Specific Modeling
  • Model Building Strategies
    • 158  Introduction to Model Building Strategies
    • 159  Manual Model Building
    • 160  Model Validation
    • 161  Regularization Methods
    • 162  Hyperparameter Optimization Strategies
    • 163  Guided Model Building in Practice
    • 164  Diagnostics, Revision, and Guided Forecasting
    • 165  Leakage, Target Encoding, and Robust Regression
  • References
  • Appendices
    • Appendices
    • A  Method Selection Guide
    • B  Presentations and Teaching Materials
    • C  R Language Concepts for Statistical Computing
    • D  Matrix Algebra
    • E  Standard Normal Table (Gaussian Table)
    • F  Critical values of Student’s \(t\) distribution with \(\nu\) degrees of freedom
    • G  Upper-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom
    • H  Lower-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom

Table of contents

  • 165.1 Open the App Full Screen
  • 165.2 Leakage Protection with Cars93
    • 165.2.1 What the Handbook Session Lets You Observe
  • 165.3 Prediction-Time Availability Is Not a Yes/No Question
  • 165.4 Grouped Data Need Grouped Splits
  • 165.5 Fold-Safe Target Encoding
    • 165.5.1 Dummy Variables Versus Target Encoding
    • 165.5.2 A Concrete Encoding Illustration with Cars93 Manufacturers
  • 165.6 Huber Robust Regression
    • 165.6.1 Coefficient Comparison in the Cars93 Example
  • 165.7 How These Guardrails Fit the Workflow
  • 165.8 Practical Exercises
  1. Model Building Strategies
  2. 165  Leakage, Target Encoding, and Robust Regression

165  Leakage, Target Encoding, and Robust Regression

When you build a model, some mistakes make the result look stronger than it really is. Others hide instability until the diagnostics stage. In this chapter you will study the protections that the Guided Model Building app applies inside the workflow and see exactly how they change model results:

  • leakage protection for target-adjacent predictors,
  • prediction-time availability metadata,
  • grouped splitting for repeated entities,
  • fold-safe target encoding,
  • Huber robust regression.

These topics are introduced here for the first time. Unlike the methods in earlier chapters (logistic regression, conditional inference trees, ARIMA), they have not been covered in standalone chapters because they are best understood inside the guided workflow where their practical consequences are immediately visible.

The practical question is always the same:

What should the workflow do when a seemingly useful modeling move is methodologically unsafe or too fragile?

165.1 Open the App Full Screen

WarningFull-screen use

These examples are best studied in a full browser tab. The app needs the available width for the ranked predictor guide, diagnostics, and revision comparison tables.

Open the blank Guided Model Building app Open the Cars93 guardrails session Open the Cars93 robust-regression session

The handbook sessions are useful here for different reasons:

  • Cars93 guardrails opens directly on a leakage problem: Price is the target, but Min.Price and Max.Price have already been attempted as predictors.
  • Cars93 robust-regression opens a fitted explanatory session in which target skewness (see Section 67.5) and outliers (see Chapter 69) make Huber regression worth comparing.

165.2 Leakage Protection with Cars93

Cars93 is a cross-sectional dataset on passenger cars sold in the early 1990s. In this section the modeling goal is predictive: the task is to predict Price without sneaking target-derived information into the predictor set.

Cars93 contains three closely related price variables:

  • Min.Price
  • Price
  • Max.Price

If the target is Price, then Min.Price and Max.Price are tempting predictors. They are also dangerous predictors. They are not ordinary exogenous variables. They are derived from the same pricing information as the target itself.

Code
library(MASS)

price_triplet <- subset(Cars93, select = c(Min.Price, Price, Max.Price))

knitr::kable(
  round(cor(price_triplet, use = "pairwise.complete.obs"), 3),
  caption = "Correlation structure among the three price variables in Cars93"
)
Correlation structure among the three price variables in Cars93
Min.Price Price Max.Price
Min.Price 1.000 0.971 0.907
Price 0.971 1.000 0.982
Max.Price 0.907 0.982 1.000

This is exactly the kind of classroom situation in which a model can appear excellent for the wrong reason. The app therefore treats it as a leakage problem rather than as an innocent correlation problem.

The logic is intentionally conservative:

  1. the predictor name must look closely related to the target name,
  2. the predictor must also have near-deterministic association with the target.

The check catches the most common classroom mistake: accidentally including a variable that is really just a reformulation of the target.

The next table uses the same helper that the app uses internally to decide which predictors should be hidden or blocked.

Code
leakage_screen <- gmb_env$target_adjacent_leakage_predictors(
  data = Cars93,
  target = "Price",
  predictors = c("Min.Price", "Max.Price", "Horsepower", "EngineSize"),
  ruleset = gmb_ruleset
)

knitr::kable(
  leakage_screen,
  digits = 3,
  caption = "Target-adjacent predictors that the app screens out or blocks for Price modeling"
)
Target-adjacent predictors that the app screens out or blocks for Price modeling
Predictor Method Association Action
Max.Price Spearman 0.984 block
Min.Price Spearman 0.974 block

165.2.1 What the Handbook Session Lets You Observe

If you open the Cars93 guardrails session, the app opens on the Audit step with the suspect predictors already attempted. That means you can see the leakage rule fire before any model is fitted.

The same block can be reproduced directly in R:

Code
leakage_audit <- gmb_env$compute_audit(
  data = Cars93,
  target = "Price",
  predictors = c("Horsepower", "Min.Price", "Max.Price"),
  row_order_meaningful = FALSE,
  time_index_variable = NULL,
  seasonal_period = 1,
  ruleset = gmb_ruleset
)

leakage_blocks <- do.call(
  rbind,
  lapply(leakage_audit$blocks, function(x) {
    data.frame(
      code = x$code,
      title = x$title,
      message = x$message,
      stringsAsFactors = FALSE
    )
  })
)

knitr::kable(
  leakage_blocks,
  caption = "Audit blocks raised when target-adjacent price variables are forced into the Cars93 model"
)
Audit blocks raised when target-adjacent price variables are forced into the Cars93 model
code title message
target_adjacent_leakage Likely target-derived leakage detected Predictor(s) Max.Price (0.984), Min.Price (0.974) appear derived from the target name ‘Price’ and have near-deterministic association with it. Remove them before fitting to avoid leakage.

This is deliberate. The workflow should not reward the learner for using information that is too close to the outcome itself.

The leakage rule is a configurable ruleset item with a strict or warn mode, so the reasoning behind the block is visible rather than hidden inside the code.

165.3 Prediction-Time Availability Is Not a Yes/No Question

Some variables are problematic even when they are not literal copies of the target. A field can be measured before the outcome and still be operationally unusable because it is published with a delay, only exists in a retrospective archive, or is effectively known only after the event of interest.

The app therefore lets the user classify predictors into availability buckets:

  • Available only with delay
  • Retrospective only / not deployable
  • Known only after the outcome
  • Derived from the outcome / target-adjacent

The same declared status is interpreted differently by goal. The table below shows the current logic.

Code
availability_demo <- expand.grid(
  status = c("available_with_delay", "retrospective_only", "post_outcome", "target_adjacent"),
  goal = c("prediction", "explanation_confirmation"),
  stringsAsFactors = FALSE
)
availability_demo$action <- mapply(
  gmb_env$predictor_availability_action,
  status = availability_demo$status,
  goal = availability_demo$goal
)
availability_demo$status <- vapply(availability_demo$status, gmb_env$availability_class_label, character(1))
availability_demo$goal <- ifelse(
  availability_demo$goal == "prediction",
  "Prediction",
  "Explanation / Confirmation"
)

knitr::kable(
  availability_demo,
  caption = "How the app interprets predictor availability declarations by modeling goal"
)
How the app interprets predictor availability declarations by modeling goal
status goal action
Available only with delay Prediction warn
Retrospective only / not deployable Prediction block
Known only after the outcome Prediction block
Derived from the outcome / target-adjacent Prediction block
Available only with delay Explanation / Confirmation info
Retrospective only / not deployable Explanation / Confirmation info
Known only after the outcome Explanation / Confirmation block
Derived from the outcome / target-adjacent Explanation / Confirmation block

For prediction, delayed variables create a warning and retrospective or post-outcome variables create a block. For explanation, the app remains strict about post-outcome and target-adjacent variables, but it allows the user to keep retrospective variables as long as their scientific role is explicit.

This is a useful teaching distinction. A variable can be scientifically interesting in a retrospective explanatory study and still be unusable in a real deployment.

165.4 Grouped Data Need Grouped Splits

Another leakage problem appears when rows are not independent. If the same patient, student, household, or firm contributes multiple rows, a naive random split can place one row in training and another row from the same entity in testing. The result is an overly optimistic validation score because the model has effectively seen part of the test unit already.

The app therefore includes an optional Group / entity variable in the Data step. Once it is set, repeated holdout keeps all rows from the same group in the same split.

In the toy data below, the grouping variable is patient. Each patient appears twice, so the rows are not independent observations from unrelated people. If one row from P3 were used for training and the second row from P3 were used for testing, the test result would no longer be a clean check on a genuinely unseen patient.

The following toy example uses the same split helper as the app.

Code
group_demo <- data.frame(
  patient = rep(paste0("P", 1:8), each = 2),
  outcome = factor(rep(c("neg", "pos"), each = 8)),
  marker = c(2.1, 2.3, 2.8, 3.0, 1.9, 2.0, 3.2, 3.3, 2.4, 2.5, 3.4, 3.6, 1.7, 1.8, 3.1, 3.0)
)

knitr::kable(
  data.frame(
    row_id = seq_len(nrow(group_demo)),
    group_demo
  ),
  caption = "Toy classification data with patient as the grouping variable"
)
Toy classification data with patient as the grouping variable
row_id patient outcome marker
1 P1 neg 2.1
2 P1 neg 2.3
3 P2 neg 2.8
4 P2 neg 3.0
5 P3 neg 1.9
6 P3 neg 2.0
7 P4 neg 3.2
8 P4 neg 3.3
9 P5 pos 2.4
10 P5 pos 2.5
11 P6 pos 3.4
12 P6 pos 3.6
13 P7 pos 1.7
14 P7 pos 1.8
15 P8 pos 3.1
16 P8 pos 3.0
Code
group_split <- gmb_env$split_grouped_classification_indices(
  data = group_demo,
  target = "outcome",
  group_variable = "patient",
  train_fraction = 0.75,
  seed = 123
)

group_row_assignment <- data.frame(
  row_id = seq_len(nrow(group_demo)),
  patient = group_demo$patient,
  outcome = group_demo$outcome,
  marker = group_demo$marker,
  split = ifelse(seq_len(nrow(group_demo)) %in% group_split$train_idx, "train", "test")
)

knitr::kable(
  group_row_assignment,
  caption = "Grouped holdout assignment: once patient is declared as the grouping variable, both rows for each patient stay together"
)
Grouped holdout assignment: once patient is declared as the grouping variable, both rows for each patient stay together
row_id patient outcome marker split
1 P1 neg 2.1 test
2 P1 neg 2.3 test
3 P2 neg 2.8 train
4 P2 neg 3.0 train
5 P3 neg 1.9 train
6 P3 neg 2.0 train
7 P4 neg 3.2 train
8 P4 neg 3.3 train
9 P5 pos 2.4 test
10 P5 pos 2.5 test
11 P6 pos 3.4 train
12 P6 pos 3.6 train
13 P7 pos 1.7 train
14 P7 pos 1.8 train
15 P8 pos 3.1 train
16 P8 pos 3.0 train

The practical lesson is simple: if the unit of prediction is a patient, then the split must respect patient boundaries too.

165.5 Fold-Safe Target Encoding

Target encoding replaces each level of a categorical predictor with the mean target value for that level. It is useful when a categorical predictor contains many levels and a purely dummy-based treatment becomes awkward or unstable. But it is also a classic place where leakage can be introduced accidentally.

Before looking at the leakage issue, it helps to compare target encoding with the more familiar dummy approach.

165.5.1 Dummy Variables Versus Target Encoding

Suppose a categorical predictor Manufacturer has three levels: Acura, Cadillac, and Geo.

  • With dummy coding (also called one-hot encoding), the model receives one indicator column per level, or one fewer if a reference category is omitted.
  • With target encoding, the model receives one numeric column whose value is the mean target for that category.

The following toy example shows the difference row by row.

Code
encoding_toy <- data.frame(
  Manufacturer = factor(c("Acura", "Cadillac", "Geo", "Acura", "Geo", "Cadillac")),
  Price = c(15, 32, 10, 16, 11, 30)
)

toy_dummies <- data.frame(model.matrix(~ Manufacturer - 1, data = encoding_toy))
toy_target_map <- tapply(encoding_toy$Price, encoding_toy$Manufacturer, mean)
toy_target_encoded <- unname(toy_target_map[as.character(encoding_toy$Manufacturer)])

toy_compare <- cbind(
  encoding_toy,
  toy_dummies,
  TargetEncodedManufacturer = toy_target_encoded
)

knitr::kable(
  toy_compare,
  digits = 2,
  caption = "One-hot encoding and target encoding for the same categorical predictor"
)
One-hot encoding and target encoding for the same categorical predictor
Manufacturer Price ManufacturerAcura ManufacturerCadillac ManufacturerGeo TargetEncodedManufacturer
Acura 15 1 0 0 15.5
Cadillac 32 0 1 0 31.0
Geo 10 0 0 1 10.5
Acura 16 1 0 0 15.5
Geo 11 0 0 1 10.5
Cadillac 30 0 1 0 31.0

Dummy coding is usually easier to explain because every level keeps its own column. Target encoding becomes attractive when the categorical predictor has many levels and a dummy-based treatment would create a wide and unstable design matrix. The cost is that the target itself now participates in the encoding, so leakage control becomes critical.

The unsafe version is easy to describe:

  1. compute the mean target value for each category on the full dataset,
  2. replace the category by that mean,
  3. validate the model.

That procedure leaks target information from the validation rows back into the encoded predictor.

The app therefore uses training-only target encoding inside each validation split.

165.5.2 A Concrete Encoding Illustration with Cars93 Manufacturers

Cars93 also contains a manufacturer variable that can be encoded as a target-aware summary for predictive price modeling. The crucial point is that the summary must be computed from the training rows only.

In other words:

  • dummy coding asks only, “Which manufacturer is this row?”
  • target encoding asks, “What was the average target value for this manufacturer in the training rows?”

That second question is exactly why leakage can occur if the validation rows are allowed to influence the encoding.

Code
set.seed(123)

cars_te <- subset(Cars93, select = c(Price, Manufacturer))
idx <- sample(seq_len(nrow(cars_te)), size = floor(0.8 * nrow(cars_te)))

train_te <- cars_te[idx, ]
test_te <- cars_te[-idx, ]

global_mean <- mean(train_te$Price, na.rm = TRUE)
full_map <- tapply(cars_te$Price, cars_te$Manufacturer, mean, na.rm = TRUE)
train_map <- tapply(train_te$Price, train_te$Manufacturer, mean, na.rm = TRUE)

test_manufacturer <- as.character(test_te$Manufacturer)
encoded_compare <- data.frame(
  Manufacturer = test_manufacturer,
  FullDataEncoding = unname(full_map[test_manufacturer]),
  TrainOnlyEncoding = unname(train_map[test_manufacturer]),
  stringsAsFactors = FALSE
)
encoded_compare$TrainOnlyEncoding[is.na(encoded_compare$TrainOnlyEncoding)] <- global_mean
encoded_compare$Difference <- encoded_compare$FullDataEncoding - encoded_compare$TrainOnlyEncoding

knitr::kable(
  head(encoded_compare[order(-abs(encoded_compare$Difference)), ], 10),
  digits = 3,
  caption = "Full-data target encoding versus training-only target encoding on the test rows"
)
Full-data target encoding versus training-only target encoding on the test rows
Manufacturer FullDataEncoding TrainOnlyEncoding Difference
3 Cadillac 37.400 19.500 17.900
4 Cadillac 37.400 19.500 17.900
1 Acura 24.900 19.500 5.400
2 Acura 24.900 19.500 5.400
17 Subaru 12.933 9.650 3.283
7 Geo 10.450 8.400 2.050
18 Toyota 17.275 15.467 1.808
19 Volkswagen 18.025 16.267 1.758
14 Nissan 17.025 18.767 -1.742
12 Mazda 17.600 19.100 -1.500

The column FullDataEncoding is the unsafe version. The column TrainOnlyEncoding is the fold-safe version that the app uses during validation.

The difference is not cosmetic. If the encoding is built on the full dataset, the model has already been allowed to see the average target behavior of the rows it is supposed to predict later.

If a category appears in the test rows but not in the training rows, the app does not borrow information from the validation data. It falls back to the training-set global mean instead.

165.6 Huber Robust Regression

The app includes Huber robust regression as a standard regression option. This matters when ordinary least squares (as used in Chapter 135) is being driven too strongly by a small number of extreme target values.

Ordinary least squares minimizes the sum of squared residuals, which means a single extreme observation can pull the fitted line substantially. Huber regression uses a modified loss function: residuals within a threshold are squared as usual, but residuals beyond that threshold are penalized linearly instead of quadratically. The result is that extreme observations receive less influence over the fit.

This does not replace the descriptive discussion of winsorization in the central tendency material. It addresses a different question:

If the target contains influential extremes, should the workflow compare a model whose fitting criterion is itself more robust?

165.6.1 Coefficient Comparison in the Cars93 Example

The Cars93 robust-regression session uses a compact explanatory model with:

  • Horsepower
  • Fuel.tank.capacity
  • EngineSize
  • AirBags

The following code compares ordinary least squares and Huber regression on that same formula.

Code
cars93_model_data <- subset(
  Cars93,
  select = c(Price, Horsepower, Fuel.tank.capacity, EngineSize, AirBags)
)

ols_fit <- lm(
  Price ~ Horsepower + Fuel.tank.capacity + EngineSize + AirBags,
  data = cars93_model_data
)

huber_fit <- MASS::rlm(
  Price ~ Horsepower + Fuel.tank.capacity + EngineSize + AirBags,
  data = cars93_model_data
)

coef_compare <- data.frame(
  Term = names(coef(ols_fit)),
  OLS = unname(coef(ols_fit)),
  Huber = unname(coef(huber_fit))
)
coef_compare$OLS <- round(coef_compare$OLS, 4)
coef_compare$Huber <- round(coef_compare$Huber, 4)

knitr::kable(
  coef_compare,
  caption = "Ordinary least squares and Huber coefficient estimates for the Cars93 example"
)
Ordinary least squares and Huber coefficient estimates for the Cars93 example
Term OLS Huber
(Intercept) 1.2838 -1.1604
Horsepower 0.1103 0.1103
Fuel.tank.capacity 0.4921 0.5377
EngineSize -0.6504 -0.9022
AirBagsDriver only -3.3891 -1.7059
AirBagsNone -6.9470 -4.5975

The app also reports how much of the sample received reduced weight under the Huber fit.

Code
huber_weights <- gmb_env$huber_weight_summary(huber_fit)

knitr::kable(
  data.frame(
    quantity = c("Observations", "Downweighted", "Share downweighted", "Minimum weight", "Median weight"),
    value = c(
      huber_weights$n,
      huber_weights$downweighted,
      round(huber_weights$share, 3),
      round(huber_weights$min_weight, 3),
      round(huber_weights$median_weight, 3)
    )
  ),
  caption = "How strongly the Huber fit downweights observations in the Cars93 example"
)
How strongly the Huber fit downweights observations in the Cars93 example
quantity value
Observations 93.000
Downweighted 22.000
Share downweighted 0.237
Minimum weight 0.128
Median weight 1.000

Huber regression does not always replace ordinary least squares. The point is that when the audit has already signaled target outliers, a robust fitting criterion deserves a place in the comparison. The weights give you a direct way to see that this is not abstract theory: some observations really do count less in the robust fit.

That is why the app suggests Huber regression as a candidate when the target-outlier warning is active.

165.7 How These Guardrails Fit the Workflow

These protections belong together because they protect different parts of the workflow:

Guardrail Main risk App response
Leakage protection target-adjacent predictors make the model look stronger than it really is hide or block suspect predictors before fitting
Availability metadata a variable is scientifically interesting but not really available at the decision cutoff warn or block based on goal and declared status
Grouped splitting train/test rows from the same entity leak information across the split keep all rows from the same group in the same split
Fold-safe target encoding preprocessing itself leaks target information into validation compute encodings inside each training split
Huber regression a few extreme target values dominate ordinary least squares compare a robust regression criterion instead of only changing the data

This chapter therefore extends Chapter 163 and Chapter 164. The earlier chapters show how the workflow is used. This chapter shows how the workflow protects itself against some of its own most dangerous failure modes.

165.8 Practical Exercises

  1. Open the Cars93 guardrails session and record the exact audit block that fires before any model is fitted.
  2. Pick one predictor from a dataset you know well and decide whether it is immediately available, delayed, retrospective-only, or post-outcome. Explain why.
  3. Reproduce the target-encoding comparison above and explain why the full-data encoding is not acceptable for validation.
  4. Compare the lm and Huber coefficients for the Cars93 formula. Which coefficients move the most?
  5. In your own words, explain the difference between a preprocessing guardrail, a split-design guardrail, and a model-fitting guardrail.
164  Diagnostics, Revision, and Guided Forecasting
References

© 2026 Patrick Wessa. Provided as-is, without warranty.

Feedback: e-mail | Anonymous contributions: click to copy (Sats) | click to copy (XMR)

Cookie Preferences