• Descriptive
    • Moments
    • Concentration
    • Central Tendency
    • Variability
    • Stem-and-Leaf Plot
    • Histogram & Frequency Table
    • Data Quality Forensics
    • Conditional EDA
    • Quantiles
    • Kernel Density Estimation
    • Normal QQ Plot
    • Bootstrap Plot

    • Multivariate Descriptive Statistics
  • Distributions
    • Binomial Probabilities
    • Geometric Probabilities
    • Negative Binomial Probabilities
    • Hypergeometric Probabilities
    • Multinomial Probabilities
    • Dirichlet
    • Poisson Probabilities

    • Exponential
    • Gamma
    • Erlang
    • Weibull
    • Rayleigh
    • Maxwell-Boltzmann
    • Lognormal
    • Pareto
    • Inverse Gamma
    • Inverse Chi-Square

    • Beta
    • Power
    • Beta Prime (Inv. Beta)
    • Triangular

    • Normal (area)
    • Logistic
    • Laplace
    • Cauchy (standard)
    • Cauchy (location-scale)
    • Gumbel
    • Fréchet
    • Generalized Extreme Value

    • Normal RNG
    • ML Fitting
    • Tukey Lambda PPCC
    • Box-Cox Normality Plot
    • Noncentral t
    • Noncentral F
    • Sample Correlation r

    • Empirical Tests
  • Hypotheses
    • Theoretical Aspects of Hypothesis Testing
    • Bayesian Inference
    • Minimum Sample Size

    • Empirical Tests
    • Multivariate (pair-wise) Testing
  • Models
    • Manual Model Building
    • Guided Model Building
  • Time Series
    • Time Series Plot
    • Decomposition
    • Exponential Smoothing

    • Blocked Bootstrap Plot
    • Mean Plot
    • (P)ACF
    • VRM
    • Standard Deviation-Mean Plot
    • Spectral Analysis
    • ARIMA

    • Cross Correlation Function
    • Granger Causality
  1. Regression Models
  2. 136  Logistic Regression
  • Preface
  • Getting Started
    • 1  Introduction
    • 2  Why Do We Need Innovative Technology?
    • 3  Basic Definitions
    • 4  The Big Picture: Why We Analyze Data
  • Introduction to Probability
    • 5  Definitions of Probability
    • 6  Jeffreys’ axiom system
    • 7  Bayes’ Theorem
    • 8  Sensitivity and Specificity
    • 9  Naive Bayes Classifier
    • 10  Law of Large Numbers

    • 11  Problems
  • Probability Distributions
    • 12  Bernoulli Distribution
    • 13  Binomial Distribution
    • 14  Geometric Distribution
    • 15  Negative Binomial Distribution
    • 16  Hypergeometric Distribution
    • 17  Multinomial Distribution
    • 18  Poisson Distribution

    • 19  Uniform Distribution (Rectangular Distribution)
    • 20  Normal Distribution (Gaussian Distribution)
    • 21  Gaussian Naive Bayes Classifier
    • 22  Chi Distribution
    • 23  Chi-squared Distribution (1 parameter)
    • 24  Chi-squared Distribution (2 parameters)
    • 25  Student t-Distribution
    • 26  Fisher F-Distribution
    • 27  Exponential Distribution
    • 28  Lognormal Distribution
    • 29  Gamma Distribution
    • 30  Beta Distribution
    • 31  Weibull Distribution
    • 32  Pareto Distribution
    • 33  Inverse Gamma Distribution
    • 34  Rayleigh Distribution
    • 35  Erlang Distribution
    • 36  Logistic Distribution
    • 37  Laplace Distribution
    • 38  Gumbel Distribution
    • 39  Cauchy Distribution
    • 40  Triangular Distribution
    • 41  Power Distribution
    • 42  Beta Prime Distribution
    • 43  Sample Correlation Distribution
    • 44  Dirichlet Distribution
    • 45  Generalized Extreme Value (GEV) Distribution
    • 46  Frechet Distribution
    • 47  Noncentral t Distribution
    • 48  Noncentral F Distribution
    • 49  Inverse Chi-Squared Distribution
    • 50  Maxwell-Boltzmann Distribution
    • 51  Distribution Relationship Map

    • 52  Problems
  • Descriptive Statistics & Exploratory Data Analysis
    • 53  Types of Data
    • 54  Datasheets

    • 55  Frequency Plot (Bar Plot)
    • 56  Frequency Table
    • 57  Contingency Table
    • 58  Binomial Classification Metrics
    • 59  Confusion Matrix
    • 60  ROC Analysis

    • 61  Stem-and-Leaf Plot
    • 62  Histogram
    • 63  Data Quality Forensics
    • 64  Quantiles
    • 65  Central Tendency
    • 66  Variability
    • 67  Skewness & Kurtosis
    • 68  Concentration
    • 69  Notched Boxplot
    • 70  Scatterplot
    • 71  Pearson Correlation
    • 72  Rank Correlation
    • 73  Partial Pearson Correlation
    • 74  Simple Linear Regression
    • 75  Moments
    • 76  Quantile-Quantile Plot (QQ Plot)
    • 77  Normal Probability Plot
    • 78  Probability Plot Correlation Coefficient Plot (PPCC Plot)
    • 79  Box-Cox Normality Plot
    • 80  Kernel Density Estimation
    • 81  Bivariate Kernel Density Plot
    • 82  Conditional EDA: Panel Diagnostics
    • 83  Bootstrap Plot (Central Tendency)
    • 84  Survey Scores Rank Order Comparison
    • 85  Cronbach Alpha

    • 86  Equi-distant Time Series
    • 87  Time Series Plot (Run Sequence Plot)
    • 88  Mean Plot
    • 89  Blocked Bootstrap Plot (Central Tendency)
    • 90  Standard Deviation-Mean Plot
    • 91  Variance Reduction Matrix
    • 92  (Partial) Autocorrelation Function
    • 93  Periodogram & Cumulative Periodogram

    • 94  Problems
  • Hypothesis Testing
    • 95  Normal Distributions revisited
    • 96  The Population
    • 97  The Sample
    • 98  The One-Sided Hypothesis Test
    • 99  The Two-Sided Hypothesis Test
    • 100  When to use a one-sided or two-sided test?
    • 101  What if \(\sigma\) is unknown?
    • 102  The Central Limit Theorem (revisited)
    • 103  Statistical Test of the Population Mean with known Variance
    • 104  Statistical Test of the Population Mean with unknown Variance
    • 105  Statistical Test of the Variance
    • 106  Statistical Test of the Population Proportion
    • 107  Statistical Test of the Standard Deviation \(\sigma\)
    • 108  Statistical Test of the difference between Means -- Independent/Unpaired Samples
    • 109  Statistical Test of the difference between Means -- Dependent/Paired Samples
    • 110  Statistical Test of the difference between Variances -- Independent/Unpaired Samples

    • 111  Hypothesis Testing for Research Purposes
    • 112  Decision Thresholds, Alpha, and Confidence Levels
    • 113  Bayesian Inference for Decision-Making
    • 114  One Sample t-Test
    • 115  Skewness & Kurtosis Tests
    • 116  Paired Two Sample t-Test
    • 117  Wilcoxon Signed-Rank Test
    • 118  Unpaired Two Sample t-Test
    • 119  Unpaired Two Sample Welch Test
    • 120  Two One-Sided Tests (TOST) for Equivalence
    • 121  Mann-Whitney U test (Wilcoxon Rank-Sum Test)
    • 122  Bayesian Two Sample Test
    • 123  Median Test based on Notched Boxplots
    • 124  Chi-Squared Tests for Count Data
    • 125  Kolmogorov-Smirnov Test
    • 126  One Way Analysis of Variance (1-way ANOVA)
    • 127  Kruskal-Wallis Test
    • 128  Two Way Analysis of Variance (2-way ANOVA)
    • 129  Repeated Measures ANOVA
    • 130  Friedman Test
    • 131  Testing Correlations
    • 132  A Note on Causality

    • 133  Problems
  • Regression Models
    • 134  Simple Linear Regression Model (SLRM)
    • 135  Multiple Linear Regression Model (MLRM)
    • 136  Logistic Regression
    • 137  Generalized Linear Models
    • 138  Multinomial and Ordinal Logistic Regression
    • 139  Cox Proportional Hazards Regression
    • 140  Conditional Inference Trees
    • 141  Leaf Diagnostics for Conditional Inference Trees
    • 142  Conditional Random Forests
    • 143  Hypothesis Testing with Linear Regression Models (from a Practical Point of View)

    • 144  Problems
  • Introduction to Time Series Analysis
    • 145  Case: the Market of Health and Personal Care Products
    • 146  Decomposition of Time Series
    • 147  Ad hoc Forecasting of Time Series
  • Box-Jenkins Analysis
    • 148  Introduction to Box-Jenkins Analysis
    • 149  Theoretical Concepts
    • 150  Stationarity
    • 151  Identifying ARMA parameters
    • 152  Estimating ARMA Parameters and Residual Diagnostics
    • 153  Forecasting with ARIMA models
    • 154  Intervention Analysis
    • 155  Cross-Correlation Function
    • 156  Transfer Function Noise Models
    • 157  General-to-Specific Modeling
  • Model Building Strategies
    • 158  Introduction to Model Building Strategies
    • 159  Manual Model Building
    • 160  Model Validation
    • 161  Regularization Methods
    • 162  Hyperparameter Optimization Strategies
    • 163  Guided Model Building in Practice
    • 164  Diagnostics, Revision, and Guided Forecasting
    • 165  Leakage, Target Encoding, and Robust Regression
  • References
  • Appendices
    • Appendices
    • A  Method Selection Guide
    • B  Presentations and Teaching Materials
    • C  R Language Concepts for Statistical Computing
    • D  Matrix Algebra
    • E  Standard Normal Table (Gaussian Table)
    • F  Critical values of Student’s \(t\) distribution with \(\nu\) degrees of freedom
    • G  Upper-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom
    • H  Lower-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom

Table of contents

  • 136.1 Definition
  • 136.2 Why Not Linear Regression?
  • 136.3 The Logistic Function
  • 136.4 The Logit Transformation
  • 136.5 Model Specification
  • 136.6 Parameter Estimation
  • 136.7 Interpretation of Coefficients
    • 136.7.1 Log-Odds Interpretation
    • 136.7.2 Odds Ratio Interpretation
  • 136.8 Hypothesis Testing
    • 136.8.1 Wald Test
    • 136.8.2 Likelihood Ratio Test
  • 136.9 Model Fit and Diagnostics
    • 136.9.1 Deviance
    • 136.9.2 Pseudo R-squared
    • 136.9.3 Hosmer-Lemeshow Test
  • 136.10 Predictions and Classification
  • 136.11 Connection to ROC Analysis
  • 136.12 R Module
    • 136.12.1 Public website
    • 136.12.2 RFC
    • 136.12.3 R Code
    • 136.12.4 Extracting Key Results
    • 136.12.5 Predictions and ROC Curve
  • 136.13 Example: Fraud Detection
    • 136.13.1 Applying ROC Analysis
  • 136.14 Separation and Convergence Diagnostics
  • 136.15 Multiple Logistic Regression
  • 136.16 Assumptions
  • 136.17 Pros & Cons
    • 136.17.1 Pros
    • 136.17.2 Cons
  • 136.18 Task
  1. Regression Models
  2. 136  Logistic Regression

136  Logistic Regression

136.1 Definition

Logistic regression is a statistical method for modeling the probability of a binary outcome as a function of one or more predictor variables. Unlike linear regression (Chapter 134) which predicts a continuous response, logistic regression predicts the probability that an observation belongs to one of two categories.

The model is widely used in classification problems where the outcome variable \(Y\) takes values 0 or 1 (e.g., disease/no disease, fraud/no fraud, success/failure).

136.2 Why Not Linear Regression?

Consider modeling a binary outcome \(Y \in \{0, 1\}\) using linear regression:

\[ P(Y = 1 | X) = \beta_0 + \beta_1 X \]

This approach has fundamental problems:

  • Predicted probabilities can fall outside the valid range \([0, 1]\)
  • The relationship between \(X\) and the probability is assumed to be linear, which is often unrealistic
  • The error terms cannot be normally distributed when the outcome is binary

Logistic regression addresses these issues by modeling the probability through a transformation that constrains predictions to lie between 0 and 1.

136.3 The Logistic Function

The logistic (sigmoid) function maps any real number to the interval \((0, 1)\):

\[ \sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z} \]

Code
z <- seq(-6, 6, length = 200)
sigma <- 1 / (1 + exp(-z))
plot(z, sigma, type = "l", lwd = 2, col = "blue",
     xlab = "z", ylab = expression(sigma(z)),
     main = "Logistic Function")
abline(h = 0.5, lty = 2, col = "gray")
abline(v = 0, lty = 2, col = "gray")
Figure 136.1: The Logistic (Sigmoid) Function

Properties of the logistic function:

  • \(\sigma(z) \in (0, 1)\) for all \(z \in \mathbb{R}\)
  • \(\sigma(0) = 0.5\)
  • \(\sigma(-z) = 1 - \sigma(z)\) (symmetry)
  • \(\lim_{z \to -\infty} \sigma(z) = 0\) and \(\lim_{z \to +\infty} \sigma(z) = 1\)

136.4 The Logit Transformation

The logit function is the inverse of the logistic function. For a probability \(p \in (0, 1)\):

\[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) \]

The quantity \(\frac{p}{1-p}\) is called the odds. If \(p = 0.75\), the odds are \(\frac{0.75}{0.25} = 3\), meaning success is three times more likely than failure.

The logit transforms probabilities from \((0, 1)\) to \((-\infty, +\infty)\), allowing us to use a linear model on this transformed scale.

136.5 Model Specification

The logistic regression model specifies that the log-odds of the outcome is a linear function of the predictors:

\[ \text{logit}(P(Y = 1 | X)) = \log\left(\frac{P(Y = 1 | X)}{1 - P(Y = 1 | X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k \]

Equivalently, the probability is:

\[ P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_k X_k)}} \]

For a single predictor:

\[ P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]

136.6 Parameter Estimation

The parameters \(\beta_0, \beta_1, ..., \beta_k\) are estimated using maximum likelihood estimation. Given \(n\) independent observations \((x_i, y_i)\) where \(y_i \in \{0, 1\}\), the likelihood function is:

\[ L(\beta) = \prod_{i=1}^{n} p_i^{y_i} (1 - p_i)^{1 - y_i} \]

where \(p_i = P(Y = 1 | X = x_i)\).

The log-likelihood is:

\[ \ell(\beta) = \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right] \]

There is no closed-form solution; the estimates are obtained through iterative numerical optimization (typically Newton-Raphson or iteratively reweighted least squares).

136.7 Interpretation of Coefficients

136.7.1 Log-Odds Interpretation

The coefficient \(\beta_j\) represents the change in the log-odds of the outcome for a one-unit increase in \(X_j\), holding other predictors constant.

136.7.2 Odds Ratio Interpretation

Exponentiating the coefficient gives the odds ratio:

\[ \text{OR}_j = e^{\beta_j} \]

The odds ratio represents the multiplicative change in the odds for a one-unit increase in \(X_j\):

  • \(\text{OR} > 1\): the predictor increases the odds of \(Y = 1\)
  • \(\text{OR} = 1\): no effect (equivalent to \(\beta = 0\))
  • \(\text{OR} < 1\): the predictor decreases the odds of \(Y = 1\)

For example, if \(\beta_1 = 0.693\), then \(\text{OR}_1 = e^{0.693} \approx 2\). A one-unit increase in \(X_1\) doubles the odds of \(Y = 1\).

136.8 Hypothesis Testing

136.8.1 Wald Test

The Wald test evaluates whether a coefficient is significantly different from zero:

\[ z = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)} \]

Under the null hypothesis \(H_0: \beta_j = 0\), the test statistic follows approximately a standard normal distribution for large samples.

136.8.2 Likelihood Ratio Test

The likelihood ratio test compares nested models:

\[ G^2 = -2 \left[ \ell(\text{reduced model}) - \ell(\text{full model}) \right] \]

Under the null hypothesis, \(G^2\) follows a chi-squared distribution with degrees of freedom equal to the difference in the number of parameters.

136.9 Model Fit and Diagnostics

136.9.1 Deviance

The deviance measures the goodness of fit:

\[ D = -2 \ell(\hat{\beta}) \]

Lower deviance indicates better fit. The null deviance (intercept-only model) can be compared to the residual deviance (full model) to assess the contribution of predictors.

136.9.2 Pseudo R-squared

Unlike linear regression, logistic regression does not have a true \(R^2\). Several pseudo R-squared measures exist:

McFadden’s R-squared (McFadden 1974):

\[ R^2_{\text{McFadden}} = 1 - \frac{\ell(\text{full model})}{\ell(\text{null model})} \]

Values between 0.2 and 0.4 are often considered good fit, but this is only a rough heuristic and depends on context and on which pseudo-\(R^2\) definition is used.

136.9.3 Hosmer-Lemeshow Test

This test (Hosmer and Lemeshow 2000) groups observations into deciles based on predicted probabilities and compares observed and expected frequencies using a chi-squared test.

# Optional: Hosmer-Lemeshow goodness-of-fit test
# install.packages("ResourceSelection")
library(ResourceSelection)
hoslem.test(model$y, fitted(model), g = 10)

136.10 Predictions and Classification

Logistic regression produces predicted probabilities \(\hat{p}_i = P(Y = 1 | X = x_i)\). To convert these into binary predictions, a classification threshold must be chosen, as discussed in Section 60.2.

The default threshold is often 0.5:

\[ \hat{Y} = \begin{cases} 1 & \text{if } \hat{p} \geq 0.5 \\ 0 & \text{if } \hat{p} < 0.5 \end{cases} \]

However, the optimal threshold depends on the costs of different types of errors. The ROC curve (Chapter 60) and pay-off matrix (Section 60.5) provide tools for selecting the appropriate threshold.

136.11 Connection to ROC Analysis

The predicted probabilities from logistic regression are exactly the type of classifier output that ROC analysis evaluates. The workflow is:

  1. Fit a logistic regression model
  2. Obtain predicted probabilities for all observations
  3. Construct the ROC curve by varying the classification threshold
  4. Compute the AUC to assess overall discriminative ability (Section 60.4)
  5. Select the optimal threshold based on Youden’s index (Youden 1950) or a pay-off matrix (Section 60.5)

The Confusion Matrix (Chapter 59) and associated metrics (Sensitivity, Specificity, Precision) can then be computed at the chosen threshold.

136.12 R Module

136.12.1 Public website

Logistic Regression is available on the public website:

  • https://compute.wessa.net/rwasp_logisticregression.wasp

136.12.2 RFC

The Logistic Regression module is available in RFC under the menu “Models / Logistic Regression”.

An interactive model-building application that includes logistic regression alongside other classification methods (naive Bayes, conditional inference trees) is available under “Models / Manual Model Building”. This application allows users to compare model performance using ROC curves (Chapter 60) and confusion matrices (Chapter 59), and to select optimal classification thresholds based on cost analysis (Section 60.5).

136.12.3 R Code

The following example demonstrates logistic regression using simulated data:

# Simulate data
set.seed(42)
n <- 200

# Predictors
age <- rnorm(n, mean = 50, sd = 10)
income <- rnorm(n, mean = 50000, sd = 15000)

# True relationship: log-odds depends on age and income
log_odds <- -5 + 0.05 * age + 0.00004 * income
prob <- 1 / (1 + exp(-log_odds))
outcome <- rbinom(n, 1, prob)

# Create data frame
data <- data.frame(outcome, age, income)

# Fit logistic regression
model <- glm(outcome ~ age + income, data = data, family = binomial)
summary(model)

Call:
glm(formula = outcome ~ age + income, family = binomial, data = data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.107e+00  1.124e+00  -4.544 5.52e-06 ***
age          3.489e-02  1.663e-02   2.097    0.036 *  
income       5.949e-05  1.227e-05   4.848 1.25e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 271.45  on 199  degrees of freedom
Residual deviance: 240.40  on 197  degrees of freedom
AIC: 246.4

Number of Fisher Scoring iterations: 4

136.12.4 Extracting Key Results

# Coefficients
cat("Coefficients:\n")
print(coef(model))

# Odds ratios with 95% CI
cat("\nOdds Ratios (with 95% CI):\n")
odds_ratios <- exp(cbind(OR = coef(model), confint(model)))
print(odds_ratios)

# Model deviance
cat("\nNull deviance:", model$null.deviance, "on", model$df.null, "df\n")
cat("Residual deviance:", model$deviance, "on", model$df.residual, "df\n")

# McFadden's R-squared
null_model <- glm(outcome ~ 1, data = data, family = binomial)
mcfadden_r2 <- 1 - (logLik(model) / logLik(null_model))
cat("\nMcFadden's R-squared:", as.numeric(mcfadden_r2), "\n")
Coefficients:
  (Intercept)           age        income 
-5.106995e+00  3.488575e-02  5.948625e-05 

Odds Ratios (with 95% CI):
                     OR        2.5 %    97.5 %
(Intercept) 0.006054248 0.0006032972 0.0502592
age         1.035501391 1.0029363455 1.0708399
income      1.000059488 1.0000363879 1.0000847

Null deviance: 271.4507 on 199 df
Residual deviance: 240.3984 on 197 df

McFadden's R-squared: 0.114394 

136.12.5 Predictions and ROC Curve

# Predicted probabilities
predicted_probs <- predict(model, type = "response")

# Classification at threshold = 0.5
predicted_class <- ifelse(predicted_probs >= 0.5, 1, 0)

# Confusion matrix
confusion <- table(Predicted = predicted_class, Actual = outcome)
print(confusion)

# Accuracy
accuracy <- sum(diag(confusion)) / sum(confusion)
cat("\nAccuracy:", round(accuracy, 3), "\n")
         Actual
Predicted  0  1
        0 93 38
        1 24 45

Accuracy: 0.69 
Code
# Compute ROC curve manually
compute_rates <- function(threshold, probs, actuals) {
  predictions <- as.integer(probs >= threshold)
  TP <- sum(predictions == 1 & actuals == 1)
  FP <- sum(predictions == 1 & actuals == 0)
  TN <- sum(predictions == 0 & actuals == 0)
  FN <- sum(predictions == 0 & actuals == 1)
  TPR <- TP / (TP + FN)
  FPR <- FP / (FP + TN)
  return(c(FPR = FPR, TPR = TPR))
}

thresholds <- seq(0, 1, by = 0.01)
roc_points <- t(sapply(thresholds, compute_rates,
                        probs = predicted_probs,
                        actuals = outcome))

# Compute AUC using trapezoidal rule
compute_auc <- function(fpr, tpr) {
  ord <- order(fpr)
  fpr <- fpr[ord]
  tpr <- tpr[ord]
  sum(diff(fpr) * (head(tpr, -1) + tail(tpr, -1)) / 2)
}

auc <- compute_auc(roc_points[, "FPR"], roc_points[, "TPR"])

# Plot ROC curve
plot(roc_points[, "FPR"], roc_points[, "TPR"],
     type = "l", lwd = 2, col = "blue",
     xlab = "False Positive Rate (1 - Specificity)",
     ylab = "True Positive Rate (Sensitivity)",
     main = paste("ROC Curve (AUC =", round(auc, 3), ")"))
abline(a = 0, b = 1, lty = 2, col = "gray")

# Mark Youden's optimal threshold
youden_idx <- which.max(roc_points[, "TPR"] - roc_points[, "FPR"])
points(roc_points[youden_idx, "FPR"], roc_points[youden_idx, "TPR"],
       pch = 19, col = "red", cex = 1.5)
legend("bottomright",
       legend = c("ROC curve", paste("Youden optimal (t =", thresholds[youden_idx], ")")),
       lty = c(1, NA), pch = c(NA, 19), col = c("blue", "red"), lwd = c(2, NA))
Figure 136.2: ROC Curve for Logistic Regression Model

136.13 Example: Fraud Detection

We apply logistic regression to the fraud detection problem introduced in Chapter 58 and Chapter 60.

# Simulated fraud data
set.seed(123)
n <- 500

# Features
transaction_amount <- rexp(n, rate = 0.01)  # Transaction amount
hour_of_day <- sample(0:23, n, replace = TRUE)  # Hour of transaction
is_foreign <- rbinom(n, 1, 0.2)  # Foreign transaction indicator

# True fraud probability depends on features
log_odds_fraud <- -4 + 0.0005 * transaction_amount +
                   0.1 * (hour_of_day < 6 | hour_of_day > 22) +
                   1.5 * is_foreign
prob_fraud <- 1 / (1 + exp(-log_odds_fraud))
is_fraud <- rbinom(n, 1, prob_fraud)

fraud_data <- data.frame(is_fraud, transaction_amount, hour_of_day, is_foreign)

cat("Fraud rate:", mean(is_fraud), "\n\n")

# Fit model
# Note: hour_of_day is generated above but excluded here intentionally to keep
# the first specification focused on two core predictors.
fraud_model <- glm(is_fraud ~ transaction_amount + is_foreign,
                   data = fraud_data, family = binomial)
summary(fraud_model)
Fraud rate: 0.048 


Call:
glm(formula = is_fraud ~ transaction_amount + is_foreign, family = binomial, 
    data = fraud_data)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -4.012939   0.396641 -10.117  < 2e-16 ***
transaction_amount  0.003877   0.001748   2.218  0.02657 *  
is_foreign          1.581474   0.429283   3.684  0.00023 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 192.58  on 499  degrees of freedom
Residual deviance: 175.26  on 497  degrees of freedom
AIC: 181.26

Number of Fisher Scoring iterations: 6
# Odds ratios
cat("\nOdds Ratios:\n")
exp(coef(fraud_model))

# Interpretation:
# - Each $1000 increase in transaction amount multiplies the odds of fraud
# - Foreign transactions have much higher odds of being fraudulent

Odds Ratios:
       (Intercept) transaction_amount         is_foreign 
        0.01808017         1.00388420         4.86211852 

136.13.1 Applying ROC Analysis

Code
# Predicted probabilities
fraud_probs <- predict(fraud_model, type = "response")

# ROC curve
roc_fraud <- t(sapply(thresholds, compute_rates,
                       probs = fraud_probs,
                       actuals = is_fraud))
auc_fraud <- compute_auc(roc_fraud[, "FPR"], roc_fraud[, "TPR"])

# Cost-optimal threshold (missed fraud costs 50x false alarm)
cost_FP <- 1
cost_FN <- 50
prevalence <- mean(is_fraud)

expected_costs <- cost_FP * roc_fraud[, "FPR"] * (1 - prevalence) +
                  cost_FN * (1 - roc_fraud[, "TPR"]) * prevalence
cost_optimal_idx <- which.min(expected_costs)
cost_optimal_threshold <- thresholds[cost_optimal_idx]

# Youden optimal
youden_idx <- which.max(roc_fraud[, "TPR"] - roc_fraud[, "FPR"])
youden_threshold <- thresholds[youden_idx]

# Plot
plot(roc_fraud[, "FPR"], roc_fraud[, "TPR"],
     type = "l", lwd = 2, col = "blue",
     xlab = "False Positive Rate", ylab = "True Positive Rate",
     main = paste("Fraud Detection ROC (AUC =", round(auc_fraud, 3), ")"))
abline(a = 0, b = 1, lty = 2, col = "gray")
points(roc_fraud[youden_idx, "FPR"], roc_fraud[youden_idx, "TPR"],
       pch = 19, col = "green", cex = 1.5)
points(roc_fraud[cost_optimal_idx, "FPR"], roc_fraud[cost_optimal_idx, "TPR"],
       pch = 17, col = "red", cex = 1.5)
legend("bottomright",
       legend = c(paste("Youden (t =", youden_threshold, ")"),
                  paste("Cost-optimal (t =", cost_optimal_threshold, ")")),
       pch = c(19, 17), col = c("green", "red"))

cat("Youden optimal threshold:", youden_threshold, "\n")
Youden optimal threshold: 0.08 
Code
cat("Cost-optimal threshold:", cost_optimal_threshold, "\n")
Cost-optimal threshold: 0 
Figure 136.3: ROC Curve for Fraud Detection Model

The cost-optimal threshold is lower than Youden’s threshold because missing a fraud (false negative) is much more costly than a false alarm (false positive). This is consistent with the analysis in Chapter 60.

136.14 Separation and Convergence Diagnostics

In practice, logistic regression can fail because of complete or quasi-complete separation (predictors perfectly classify the outcome). Symptoms include very large coefficient estimates, huge standard errors, and convergence warnings.

Recommended workflow:

  1. Check for warnings from glm() (non-convergence, fitted probabilities near 0 or 1).
  2. Inspect sparse cells and near-perfect rules in contingency tables.
  3. If separation is present, use penalized methods (e.g., Firth logistic regression (Firth 1993; Heinze and Schemper 2002)) or simplify the model.

136.15 Multiple Logistic Regression

With multiple predictors, the interpretation extends naturally:

\[ \text{logit}(P(Y = 1)) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k \]

Each coefficient \(\beta_j\) represents the change in log-odds for a one-unit increase in \(X_j\), holding all other predictors constant. This is analogous to the interpretation in multiple linear regression (Chapter 135).

Interaction terms and polynomial terms can be included:

# Interaction between amount and foreign transaction
model_interaction <- glm(is_fraud ~ transaction_amount * is_foreign,
                         data = fraud_data, family = binomial)

# Polynomial term for non-linear effect
model_poly <- glm(is_fraud ~ poly(transaction_amount, 2) + is_foreign,
                  data = fraud_data, family = binomial)

136.16 Assumptions

Logistic regression makes the following assumptions:

  1. Binary outcome: The dependent variable must be binary (or binomial counts)
  2. Independence: Observations are independent of each other
  3. Linearity in log-odds: The relationship between predictors and log-odds is linear
  4. No perfect multicollinearity (identification): Predictors cannot be exact linear combinations of each other
  5. Large sample size: Maximum likelihood estimation requires adequate sample size (rule of thumb: at least 10 events per predictor)

High (imperfect) multicollinearity is a practical estimation concern because it inflates standard errors and can destabilize coefficient estimates.

Unlike linear regression, logistic regression does not assume:

  • Normality of residuals
  • Homoscedasticity (constant variance)
  • Linear relationship between predictors and the outcome

136.17 Pros & Cons

136.17.1 Pros

Logistic regression has the following advantages:

  • Produces interpretable coefficients as odds ratios.
  • Outputs probabilities that can be used with ROC analysis for threshold optimization.
  • Does not require normally distributed predictors.
  • Handles both continuous and categorical predictors.
  • Well-established statistical inference (confidence intervals, hypothesis tests).
  • Computationally efficient even for large datasets.

136.17.2 Cons

Logistic regression has the following disadvantages:

  • Assumes a linear relationship between predictors and log-odds, which may not hold.
  • Sensitive to outliers in the predictor variables.
  • Requires a relatively large sample size, especially when the outcome is rare.
  • Cannot directly handle missing data.
  • May underperform compared to more flexible methods (e.g., random forests, neural networks) when relationships are highly non-linear.

136.18 Task

  1. Using the fraud detection data or a dataset of your choice, fit a logistic regression model. Interpret the coefficients as odds ratios.

  2. Construct the ROC curve for your fitted model and compute the AUC. How does the model’s discriminative ability compare to the AUC interpretation guidelines in Table 60.2?

  3. Define a pay-off matrix appropriate for your application. What is the cost-optimal threshold, and how does it differ from the default threshold of 0.5?

  4. Compare the predictions from logistic regression at the cost-optimal threshold with predictions at the Youden-optimal threshold. Compute the Confusion Matrix (Chapter 59) for both and discuss the trade-offs.

Firth, David. 1993. “Bias Reduction of Maximum Likelihood Estimates.” Biometrika 80 (1): 27–38.
Heinze, Georg, and Michael Schemper. 2002. “A Solution to the Problem of Separation in Logistic Regression.” Statistics in Medicine 21 (16): 2409–19.
Hosmer, David W., and Stanley Lemeshow. 2000. Applied Logistic Regression. 2nd ed. New York: John Wiley & Sons.
McFadden, Daniel. 1974. “Conditional Logit Analysis of Qualitative Choice Behavior.” In Frontiers in Econometrics, edited by Paul Zarembka, 105–42. New York: Academic Press.
Youden, W. J. 1950. “Index for Rating Diagnostic Tests.” Cancer 3 (1): 32–35. https://doi.org/10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3.
135  Multiple Linear Regression Model (MLRM)
137  Generalized Linear Models

© 2026 Patrick Wessa. Provided as-is, without warranty.

Feedback: e-mail | Anonymous contributions: click to copy (Sats) | click to copy (XMR)

Cookie Preferences