• Descriptive
    • Moments
    • Concentration
    • Central Tendency
    • Variability
    • Stem-and-Leaf Plot
    • Histogram & Frequency Table
    • Data Quality Forensics
    • Conditional EDA
    • Quantiles
    • Kernel Density Estimation
    • Normal QQ Plot
    • Bootstrap Plot

    • Multivariate Descriptive Statistics
  • Distributions
    • Binomial Probabilities
    • Geometric Probabilities
    • Negative Binomial Probabilities
    • Hypergeometric Probabilities
    • Multinomial Probabilities
    • Dirichlet
    • Poisson Probabilities

    • Exponential
    • Gamma
    • Erlang
    • Weibull
    • Rayleigh
    • Maxwell-Boltzmann
    • Lognormal
    • Pareto
    • Inverse Gamma
    • Inverse Chi-Square

    • Beta
    • Power
    • Beta Prime (Inv. Beta)
    • Triangular

    • Normal (area)
    • Logistic
    • Laplace
    • Cauchy (standard)
    • Cauchy (location-scale)
    • Gumbel
    • Fréchet
    • Generalized Extreme Value

    • Normal RNG
    • ML Fitting
    • Tukey Lambda PPCC
    • Box-Cox Normality Plot
    • Noncentral t
    • Noncentral F
    • Sample Correlation r

    • Empirical Tests
  • Hypotheses
    • Theoretical Aspects of Hypothesis Testing
    • Bayesian Inference
    • Minimum Sample Size

    • Empirical Tests
    • Multivariate (pair-wise) Testing
  • Models
    • Manual Model Building
    • Guided Model Building
  • Time Series
    • Time Series Plot
    • Decomposition
    • Exponential Smoothing

    • Blocked Bootstrap Plot
    • Mean Plot
    • (P)ACF
    • VRM
    • Standard Deviation-Mean Plot
    • Spectral Analysis
    • ARIMA

    • Cross Correlation Function
    • Granger Causality
  1. Model Building Strategies
  2. 163  Guided Model Building in Practice
  • Preface
  • Getting Started
    • 1  Introduction
    • 2  Why Do We Need Innovative Technology?
    • 3  Basic Definitions
    • 4  The Big Picture: Why We Analyze Data
  • Introduction to Probability
    • 5  Definitions of Probability
    • 6  Jeffreys’ axiom system
    • 7  Bayes’ Theorem
    • 8  Sensitivity and Specificity
    • 9  Naive Bayes Classifier
    • 10  Law of Large Numbers

    • 11  Problems
  • Probability Distributions
    • 12  Bernoulli Distribution
    • 13  Binomial Distribution
    • 14  Geometric Distribution
    • 15  Negative Binomial Distribution
    • 16  Hypergeometric Distribution
    • 17  Multinomial Distribution
    • 18  Poisson Distribution

    • 19  Uniform Distribution (Rectangular Distribution)
    • 20  Normal Distribution (Gaussian Distribution)
    • 21  Gaussian Naive Bayes Classifier
    • 22  Chi Distribution
    • 23  Chi-squared Distribution (1 parameter)
    • 24  Chi-squared Distribution (2 parameters)
    • 25  Student t-Distribution
    • 26  Fisher F-Distribution
    • 27  Exponential Distribution
    • 28  Lognormal Distribution
    • 29  Gamma Distribution
    • 30  Beta Distribution
    • 31  Weibull Distribution
    • 32  Pareto Distribution
    • 33  Inverse Gamma Distribution
    • 34  Rayleigh Distribution
    • 35  Erlang Distribution
    • 36  Logistic Distribution
    • 37  Laplace Distribution
    • 38  Gumbel Distribution
    • 39  Cauchy Distribution
    • 40  Triangular Distribution
    • 41  Power Distribution
    • 42  Beta Prime Distribution
    • 43  Sample Correlation Distribution
    • 44  Dirichlet Distribution
    • 45  Generalized Extreme Value (GEV) Distribution
    • 46  Frechet Distribution
    • 47  Noncentral t Distribution
    • 48  Noncentral F Distribution
    • 49  Inverse Chi-Squared Distribution
    • 50  Maxwell-Boltzmann Distribution
    • 51  Distribution Relationship Map

    • 52  Problems
  • Descriptive Statistics & Exploratory Data Analysis
    • 53  Types of Data
    • 54  Datasheets

    • 55  Frequency Plot (Bar Plot)
    • 56  Frequency Table
    • 57  Contingency Table
    • 58  Binomial Classification Metrics
    • 59  Confusion Matrix
    • 60  ROC Analysis

    • 61  Stem-and-Leaf Plot
    • 62  Histogram
    • 63  Data Quality Forensics
    • 64  Quantiles
    • 65  Central Tendency
    • 66  Variability
    • 67  Skewness & Kurtosis
    • 68  Concentration
    • 69  Notched Boxplot
    • 70  Scatterplot
    • 71  Pearson Correlation
    • 72  Rank Correlation
    • 73  Partial Pearson Correlation
    • 74  Simple Linear Regression
    • 75  Moments
    • 76  Quantile-Quantile Plot (QQ Plot)
    • 77  Normal Probability Plot
    • 78  Probability Plot Correlation Coefficient Plot (PPCC Plot)
    • 79  Box-Cox Normality Plot
    • 80  Kernel Density Estimation
    • 81  Bivariate Kernel Density Plot
    • 82  Conditional EDA: Panel Diagnostics
    • 83  Bootstrap Plot (Central Tendency)
    • 84  Survey Scores Rank Order Comparison
    • 85  Cronbach Alpha

    • 86  Equi-distant Time Series
    • 87  Time Series Plot (Run Sequence Plot)
    • 88  Mean Plot
    • 89  Blocked Bootstrap Plot (Central Tendency)
    • 90  Standard Deviation-Mean Plot
    • 91  Variance Reduction Matrix
    • 92  (Partial) Autocorrelation Function
    • 93  Periodogram & Cumulative Periodogram

    • 94  Problems
  • Hypothesis Testing
    • 95  Normal Distributions revisited
    • 96  The Population
    • 97  The Sample
    • 98  The One-Sided Hypothesis Test
    • 99  The Two-Sided Hypothesis Test
    • 100  When to use a one-sided or two-sided test?
    • 101  What if \(\sigma\) is unknown?
    • 102  The Central Limit Theorem (revisited)
    • 103  Statistical Test of the Population Mean with known Variance
    • 104  Statistical Test of the Population Mean with unknown Variance
    • 105  Statistical Test of the Variance
    • 106  Statistical Test of the Population Proportion
    • 107  Statistical Test of the Standard Deviation \(\sigma\)
    • 108  Statistical Test of the difference between Means -- Independent/Unpaired Samples
    • 109  Statistical Test of the difference between Means -- Dependent/Paired Samples
    • 110  Statistical Test of the difference between Variances -- Independent/Unpaired Samples

    • 111  Hypothesis Testing for Research Purposes
    • 112  Decision Thresholds, Alpha, and Confidence Levels
    • 113  Bayesian Inference for Decision-Making
    • 114  One Sample t-Test
    • 115  Skewness & Kurtosis Tests
    • 116  Paired Two Sample t-Test
    • 117  Wilcoxon Signed-Rank Test
    • 118  Unpaired Two Sample t-Test
    • 119  Unpaired Two Sample Welch Test
    • 120  Two One-Sided Tests (TOST) for Equivalence
    • 121  Mann-Whitney U test (Wilcoxon Rank-Sum Test)
    • 122  Bayesian Two Sample Test
    • 123  Median Test based on Notched Boxplots
    • 124  Chi-Squared Tests for Count Data
    • 125  Kolmogorov-Smirnov Test
    • 126  One Way Analysis of Variance (1-way ANOVA)
    • 127  Kruskal-Wallis Test
    • 128  Two Way Analysis of Variance (2-way ANOVA)
    • 129  Repeated Measures ANOVA
    • 130  Friedman Test
    • 131  Testing Correlations
    • 132  A Note on Causality

    • 133  Problems
  • Regression Models
    • 134  Simple Linear Regression Model (SLRM)
    • 135  Multiple Linear Regression Model (MLRM)
    • 136  Logistic Regression
    • 137  Generalized Linear Models
    • 138  Multinomial and Ordinal Logistic Regression
    • 139  Cox Proportional Hazards Regression
    • 140  Conditional Inference Trees
    • 141  Leaf Diagnostics for Conditional Inference Trees
    • 142  Conditional Random Forests
    • 143  Hypothesis Testing with Linear Regression Models (from a Practical Point of View)

    • 144  Problems
  • Introduction to Time Series Analysis
    • 145  Case: the Market of Health and Personal Care Products
    • 146  Decomposition of Time Series
    • 147  Ad hoc Forecasting of Time Series
  • Box-Jenkins Analysis
    • 148  Introduction to Box-Jenkins Analysis
    • 149  Theoretical Concepts
    • 150  Stationarity
    • 151  Identifying ARMA parameters
    • 152  Estimating ARMA Parameters and Residual Diagnostics
    • 153  Forecasting with ARIMA models
    • 154  Intervention Analysis
    • 155  Cross-Correlation Function
    • 156  Transfer Function Noise Models
    • 157  General-to-Specific Modeling
  • Model Building Strategies
    • 158  Introduction to Model Building Strategies
    • 159  Manual Model Building
    • 160  Model Validation
    • 161  Regularization Methods
    • 162  Hyperparameter Optimization Strategies
    • 163  Guided Model Building in Practice
    • 164  Diagnostics, Revision, and Guided Forecasting
    • 165  Leakage, Target Encoding, and Robust Regression
  • References
  • Appendices
    • Appendices
    • A  Method Selection Guide
    • B  Presentations and Teaching Materials
    • C  R Language Concepts for Statistical Computing
    • D  Matrix Algebra
    • E  Standard Normal Table (Gaussian Table)
    • F  Critical values of Student’s \(t\) distribution with \(\nu\) degrees of freedom
    • G  Upper-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom
    • H  Lower-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom

Table of contents

  • 163.1 Open the App Full Screen
  • 163.2 The Workflow Used in This Chapter
  • 163.3 Three Workflow Controls That Matter More Than They First Appear
  • 163.4 Worked Example 1: Explanatory Regression with Cars93
    • 163.4.1 What the Handbook Session Already Sets Up
    • 163.4.2 What the Audit Teaches
    • 163.4.3 What the Strategy Step Adds
    • 163.4.4 What the Model Comparison Shows
  • 163.5 Worked Example 2: Predictive Classification with PimaIndiansDiabetes2
    • 163.5.1 What the Handbook Session Already Sets Up
    • 163.5.2 What the Fitted Comparison Teaches
    • 163.5.3 Why the Goal Matters Here
  • 163.6 Exports, Reports, and Session Logic
  • 163.7 Practical Exercises
  1. Model Building Strategies
  2. 163  Guided Model Building in Practice

163  Guided Model Building in Practice

The purpose of this chapter is practical: you will open the Guided Model Building app, work through concrete datasets, and see how audit, strategy, model fitting, diagnostics, and export are connected inside one scientific workflow.

The app keeps every methodological choice visible so that you can review, revise, and defend your reasoning at each stage.

163.1 Open the App Full Screen

WarningFull-screen use

The Guided Model Building app is not well suited for the narrow handbook column. Open it in a new tab so that the data step, model comparison step, and diagnostics all remain readable.

Open the blank Guided Model Building app Open the Cars93 handbook session Open the Pima handbook session

The handbook links above do not reopen an old learner session. They load a read-only chapter template on the server and create a fresh working session from it, so you always start from the same clean state.

163.2 The Workflow Used in This Chapter

The app organizes model building into a small number of stages:

Step What the user does Role in the workflow
Start choose processing mode and begin a session makes retention and replay explicit
Data choose the dataset, target, goal, predictors, optional group variable, and any prediction-time availability exceptions ensures modeling begins with a clear research question and a realistic deployment story
Audit inspect warnings before fitting connects model building to Chapter 63
Strategy review preprocessing and candidate workflows makes the initial path visible and challengeable
Models fit a small candidate set emphasizes comparison over single-model commitment
Diagnostics inspect residual, ROC, calibration, or forecasting behavior turns criticism into part of the workflow
Export download the report and R script makes the reasoning portable and reproducible

The most important practical distinction appears early: the user must choose between prediction and explanation / confirmation (see Section 158.2). That one decision changes how redundancy, validation, interpretability, and diagnostics are weighted.

163.3 Three Workflow Controls That Matter More Than They First Appear

The current version of the app adds three controls that are easy to overlook if you only focus on the model list:

Control Where it appears Why it matters
Group / entity variable Data step keeps repeated rows from the same unit in the same split, so the model is not trained on one row from a patient and tested on another
Prediction-time availability exceptions Data step separates variables that are immediately available, delayed, retrospective-only, or post-outcome
Locked final test set confirmatory tabular workflows reserves a hidden final split for one last check after model choice and revision

These controls are not decorative. They change what counts as a scientifically defensible workflow. A model can have good coefficients or strong validation metrics and still be unacceptable if it leaks information across rows, uses a variable that would not exist at prediction time, or keeps reusing the same test evidence after many revisions.

For students, the most important practical reading rule is this:

  • use the Data, Audit, Strategy, Models, and Diagnostics steps to build and challenge the workflow,
  • use the locked final test only at the end, when you are ready for one last confirmatory check in Export.

163.4 Worked Example 1: Explanatory Regression with Cars93

Cars93 is a cross-sectional dataset on passenger cars sold in the early 1990s. It contains technical characteristics, size variables, equipment indicators, and price information. In this chapter, the target is Price, and the goal is not just to predict price mechanically but to study how a guided workflow reacts when explanatory regression is confronted with skewness and outlying values.

Code
library(MASS)

cars93_app <- subset(
  Cars93,
  select = c(Price, Horsepower, Fuel.tank.capacity, EngineSize, AirBags)
)

knitr::kable(
  head(cars93_app, 8),
  caption = "Variables used in the handbook session for Cars93"
)
Variables used in the handbook session for Cars93
Price Horsepower Fuel.tank.capacity EngineSize AirBags
15.9 140 13.2 1.8 None
33.9 200 18.0 3.2 Driver & Passenger
29.1 172 16.9 2.8 Driver only
37.7 172 21.1 2.8 Driver & Passenger
30.0 208 21.1 3.5 Driver only
15.7 110 16.4 2.2 Driver only
20.8 170 18.0 3.8 Driver only
23.7 180 23.0 5.7 Driver only
Code
numeric_cars93 <- cars93_app[sapply(cars93_app, is.numeric)]

knitr::kable(
  round(cor(numeric_cars93, use = "pairwise.complete.obs"), 2),
  caption = "Numeric association structure for the Cars93 example"
)
Numeric association structure for the Cars93 example
Price Horsepower Fuel.tank.capacity EngineSize
Price 1.00 0.79 0.62 0.60
Horsepower 0.79 1.00 0.71 0.73
Fuel.tank.capacity 0.62 0.71 1.00 0.76
EngineSize 0.60 0.73 0.76 1.00

163.4.1 What the Handbook Session Already Sets Up

The prepared Cars93 session opens with:

  • target: Price
  • predictors: Horsepower, Fuel.tank.capacity, EngineSize, AirBags
  • goal: Explanation / Confirmation
  • interpretability priority: High

This configuration forces the app to balance coefficient-style interpretation against target outliers and skewness. This setup places the analysis in an explanatory setting: the user wants a defensible account of price differences across cars, not merely the smallest possible prediction error.

163.4.2 What the Audit Teaches

In this handbook session, the audit fires two important warnings:

  • R016_target_outlier_warning — the target contains values far from the bulk of the data (see Chapter 69 for how outliers are identified),
  • R010_positive_skew_log — the target distribution is right-skewed (see Section 67.5 for the formal definition).

At this stage the question is no longer “Which regression model do I like?” but rather:

  1. Is the target scale stable enough for ordinary least squares?
  2. Would a transform change the interpretation in a useful way?
  3. Does a robust regression path deserve comparison?

The audit output directly shapes which models and transforms are worth considering next.

163.4.3 What the Strategy Step Adds

For this session the app proposes a strategy that combines:

  • log transformation of the target (see Chapter 79 for the general family of power transforms),
  • rare-level pooling (combining categorical levels that appear too few times into a single __OTHER__ group, so that the model does not try to estimate a separate effect from a handful of rows),
  • winsorization of numeric predictors (clipping extreme values to the 1st and 99th percentiles of the training set, as introduced for the winsorized mean in Chapter 66),
  • target encoding where appropriate (discussed in detail in Section 165.5),
  • a compact candidate set of lm, Huber, stepwise AIC, stepwise BIC, ctree, and cforest (Chapter 142).

The current guided app keeps this first working slice intentionally compact, but it now includes regularized linear regression and regularized logistic regression as coefficient-based shrinkage candidates. Those paths tune ridge, elastic-net, and lasso style penalties on the training rows only and then report the selected penalty family, alpha, lambda.min, and lambda.1se in the model details. What the app still does not expose is a large free-form search grid across many model families. The method background for those tuning decisions is therefore treated separately in Chapter 161 and Chapter 162.

Stepwise AIC and stepwise BIC are automated variable-selection procedures. They start from the full predictor set and remove or add predictors one at a time, guided by the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). AIC tends to keep more predictors; BIC penalizes model size more heavily and usually produces a smaller formula.

Preprocessing refers to any transformation applied to the data before model fitting: rescaling, imputation, encoding, or outlier treatment. This step shows that preprocessing and feature treatment are already part of the scientific argument, not afterthoughts to be handled separately.

163.4.4 What the Model Comparison Shows

In the prepared session the default selected model is Huber robust regression (introduced in Section 165.6). The repeated-holdout summary attached to the selected path reports:

  • mean held-out RMSE: 3.4147
  • mean held-out MAE: 2.6794
  • mean held-out R^2: 0.8083 (see Chapter 135 for the interpretation of \(R^2\))

Repeated holdout (see Section 160.3) means that the data are randomly split into a training set and a held-out test set multiple times, the model is fitted on each training set, and the error metrics are averaged across splits.

The point is not that Huber regression is always superior. The point is that, under an explanatory goal plus target outlier warnings, a robust coefficient-based model is often more defensible than pretending that the ordinary least squares assumptions were never stressed.

This is exactly where the chapter on Chapter 76 becomes relevant. Residual shape and outlier sensitivity are not decorative diagnostics. They change which model deserves to be taken seriously.

If a tree-based regression alternative becomes competitive in this kind of workflow, the next interpretive question is not only where the tree splits, but also how reliable the terminal nodes look internally. That is the specific topic of Chapter 141.

In confirmatory tabular workflows, the current app also reserves a locked final test split before model fitting begins. That split is hidden while the analyst compares candidates and tests revisions. It is revealed only at the end, when the user wants one last explicit confirmation that the selected path still holds up on untouched data.

So the workflow for this kind of session is not “fit once, then immediately read the final result.” It is:

  1. build the model on the analysis subset,
  2. revise it if necessary,
  3. commit to a final path,
  4. only then reveal the locked final test evaluation.

163.5 Worked Example 2: Predictive Classification with PimaIndiansDiabetes2

PimaIndiansDiabetes2 is a medical screening dataset in which the target records whether diabetes is present. The predictors are clinical measurements such as glucose, body mass, insulin, triceps skinfold thickness, pedigree, and age. Here the goal is explicitly predictive: the task is to classify future cases as well as possible while still keeping the workflow transparent.

Code
if (requireNamespace("mlbench", quietly = TRUE)) {
  data("PimaIndiansDiabetes2", package = "mlbench")
  pima_app <- PimaIndiansDiabetes2

  pima_missing <- data.frame(
    variable = names(pima_app),
    missing = colSums(is.na(pima_app))
  )

  knitr::kable(
    pima_missing,
    caption = "Missing-value counts in PimaIndiansDiabetes2"
  )
} else {
  knitr::kable(
    data.frame(note = "Package 'mlbench' was not available while rendering this chapter."),
    caption = "PimaIndiansDiabetes2 availability"
  )
}
Missing-value counts in PimaIndiansDiabetes2
variable missing
pregnant pregnant 0
glucose glucose 5
pressure pressure 35
triceps triceps 227
insulin insulin 374
mass mass 11
pedigree pedigree 0
age age 0
diabetes diabetes 0
Code
if (exists("pima_app")) {
  diabetes_share <- round(prop.table(table(pima_app$diabetes)), 3)
  knitr::kable(
    data.frame(class = names(diabetes_share), share = as.numeric(diabetes_share)),
    caption = "Outcome shares in PimaIndiansDiabetes2"
  )
}
Outcome shares in PimaIndiansDiabetes2
class share
neg 0.651
pos 0.349

163.5.1 What the Handbook Session Already Sets Up

The prepared PimaIndiansDiabetes2 session opens with:

  • target: diabetes
  • goal: Prediction
  • predictors: glucose, mass, age, insulin, triceps, pregnant, pedigree

This specification reflects a typical screening problem: several biologically plausible predictors are available, the target is binary, and missing values need to be handled before model comparison begins.

The preprocessing choices are also part of the lesson:

  • missing-value imputation is enabled (numeric columns are filled with the training-set median; categorical columns with the training-set mode),
  • numeric winsorization is enabled,
  • target encoding is disabled,
  • the app evaluates a compact classifier set: logistic regression (Chapter 136), Gaussian Naive Bayes, conditional inference trees (Chapter 140), and conditional random forests (Chapter 142).

163.5.2 What the Fitted Comparison Teaches

In the prepared session the default selected model is logistic regression. Its validation summary reports:

  • mean held-out accuracy: 0.7695
  • mean held-out AUC: 0.8311

This is the right place to connect the workflow app back to the method chapters:

  • coefficient interpretation from Chapter 136,
  • classification summaries from Chapter 59,
  • discrimination from Chapter 60,
  • tree-based alternatives from Chapter 140,
  • ensemble tree benchmarks from Chapter 142.

The app adds something new: it makes those ideas compete inside one concrete workflow rather than teaching them as isolated tools.

The diagnostics now also make a distinction that students often miss at first reading:

  • the automatic default is based on the mean repeated-validation AUC,
  • the center line in the predictive-stability boxplot is the median repeated-validation AUC,
  • the diamond shows the current held-out split rather than the repeated average.

This means a model can have the highest median but still lose on the mean if the resample distribution is skewed. The app therefore asks you to think about two questions at once:

  1. Which model is strongest on average?
  2. Which model is most reliable across resamples?

That is why the stability section now includes both a distribution plot and a mean-versus-variability plot.

163.5.3 Why the Goal Matters Here

In a predictive classification exercise, the scientifically relevant question is not merely:

Which model can be interpreted most elegantly?

It is also:

Which model maintains useful discrimination on unseen cases?

That is why the app keeps threshold-aware ROC diagnostics in the workflow. A logistic model can remain preferable even when a tree is more visually intuitive, provided the out-of-sample discrimination is better and the predictive goal is explicit.

The app also shows AUCPR as a complementary metric. AUC remains the automatic ranking criterion, but AUCPR becomes useful when the analyst cares especially about precision among the positive predictions and wants to judge whether a model that looks good on ROC-based discrimination is still attractive when false positives are more costly.

163.6 Exports, Reports, and Session Logic

Every serious workflow should end with files you can save, share, and reuse. The app therefore allows you to export:

  • an HTML report,
  • a reproducible R script,
  • a saved session that you can reopen later.

This is especially useful for teaching. A learner can:

  1. run the workflow,
  2. export the report,
  3. justify the model choice,
  4. return later and resume the session.

The chapter-session links extend that idea: each link opens a prepared workflow state so that you and the handbook are looking at the same starting point.

163.7 Practical Exercises

  1. Open the blank app and rebuild the Cars93 example from scratch instead of using the prepared session. Write down which audit warnings appear before you fit anything.
  2. Open the Cars93 handbook session and explain why Huber regression is preferred to ordinary least squares in this particular workflow.
  3. Open the PimaIndiansDiabetes2 handbook session and compare logistic regression with the tree model. Does the better predictive model also produce the clearest scientific explanation?
  4. In the PimaIndiansDiabetes2 session, inspect the predictive-stability plots. Which model looks strongest on average, and which looks most reliable across resamples?
  5. Export both handbook sessions and inspect the generated R scripts. Which parts are full reproductions of the chosen path, and which parts are still simplifications?

The next chapter, Chapter 164, continues from this point and focuses on the revision loop: when to test a new path, how to compare it, and when it is justified to promote it.

162  Hyperparameter Optimization Strategies
164  Diagnostics, Revision, and Guided Forecasting

© 2026 Patrick Wessa. Provided as-is, without warranty.

Feedback: e-mail | Anonymous contributions: click to copy (Sats) | click to copy (XMR)

Cookie Preferences