• Descriptive
    • Moments
    • Concentration
    • Central Tendency
    • Variability
    • Stem-and-Leaf Plot
    • Histogram & Frequency Table
    • Data Quality Forensics
    • Conditional EDA
    • Quantiles
    • Kernel Density Estimation
    • Normal QQ Plot
    • Bootstrap Plot

    • Multivariate Descriptive Statistics
  • Distributions
    • Binomial Probabilities
    • Geometric Probabilities
    • Negative Binomial Probabilities
    • Hypergeometric Probabilities
    • Multinomial Probabilities
    • Dirichlet
    • Poisson Probabilities

    • Exponential
    • Gamma
    • Erlang
    • Weibull
    • Rayleigh
    • Maxwell-Boltzmann
    • Lognormal
    • Pareto
    • Inverse Gamma
    • Inverse Chi-Square

    • Beta
    • Power
    • Beta Prime (Inv. Beta)
    • Triangular

    • Normal (area)
    • Logistic
    • Laplace
    • Cauchy (standard)
    • Cauchy (location-scale)
    • Gumbel
    • Fréchet
    • Generalized Extreme Value

    • Normal RNG
    • ML Fitting
    • Tukey Lambda PPCC
    • Box-Cox Normality Plot
    • Noncentral t
    • Noncentral F
    • Sample Correlation r

    • Empirical Tests
  • Hypotheses
    • Theoretical Aspects of Hypothesis Testing
    • Bayesian Inference
    • Minimum Sample Size

    • Empirical Tests
    • Multivariate (pair-wise) Testing
  • Models
    • Manual Model Building
    • Guided Model Building
  • Time Series
    • Time Series Plot
    • Decomposition
    • Exponential Smoothing

    • Blocked Bootstrap Plot
    • Mean Plot
    • (P)ACF
    • VRM
    • Standard Deviation-Mean Plot
    • Spectral Analysis
    • ARIMA

    • Cross Correlation Function
    • Granger Causality
  1. Regression Models
  2. 143  Hypothesis Testing with Linear Regression Models (from a Practical Point of View)
  • Preface
  • Getting Started
    • 1  Introduction
    • 2  Why Do We Need Innovative Technology?
    • 3  Basic Definitions
    • 4  The Big Picture: Why We Analyze Data
  • Introduction to Probability
    • 5  Definitions of Probability
    • 6  Jeffreys’ axiom system
    • 7  Bayes’ Theorem
    • 8  Sensitivity and Specificity
    • 9  Naive Bayes Classifier
    • 10  Law of Large Numbers

    • 11  Problems
  • Probability Distributions
    • 12  Bernoulli Distribution
    • 13  Binomial Distribution
    • 14  Geometric Distribution
    • 15  Negative Binomial Distribution
    • 16  Hypergeometric Distribution
    • 17  Multinomial Distribution
    • 18  Poisson Distribution

    • 19  Uniform Distribution (Rectangular Distribution)
    • 20  Normal Distribution (Gaussian Distribution)
    • 21  Gaussian Naive Bayes Classifier
    • 22  Chi Distribution
    • 23  Chi-squared Distribution (1 parameter)
    • 24  Chi-squared Distribution (2 parameters)
    • 25  Student t-Distribution
    • 26  Fisher F-Distribution
    • 27  Exponential Distribution
    • 28  Lognormal Distribution
    • 29  Gamma Distribution
    • 30  Beta Distribution
    • 31  Weibull Distribution
    • 32  Pareto Distribution
    • 33  Inverse Gamma Distribution
    • 34  Rayleigh Distribution
    • 35  Erlang Distribution
    • 36  Logistic Distribution
    • 37  Laplace Distribution
    • 38  Gumbel Distribution
    • 39  Cauchy Distribution
    • 40  Triangular Distribution
    • 41  Power Distribution
    • 42  Beta Prime Distribution
    • 43  Sample Correlation Distribution
    • 44  Dirichlet Distribution
    • 45  Generalized Extreme Value (GEV) Distribution
    • 46  Frechet Distribution
    • 47  Noncentral t Distribution
    • 48  Noncentral F Distribution
    • 49  Inverse Chi-Squared Distribution
    • 50  Maxwell-Boltzmann Distribution
    • 51  Distribution Relationship Map

    • 52  Problems
  • Descriptive Statistics & Exploratory Data Analysis
    • 53  Types of Data
    • 54  Datasheets

    • 55  Frequency Plot (Bar Plot)
    • 56  Frequency Table
    • 57  Contingency Table
    • 58  Binomial Classification Metrics
    • 59  Confusion Matrix
    • 60  ROC Analysis

    • 61  Stem-and-Leaf Plot
    • 62  Histogram
    • 63  Data Quality Forensics
    • 64  Quantiles
    • 65  Central Tendency
    • 66  Variability
    • 67  Skewness & Kurtosis
    • 68  Concentration
    • 69  Notched Boxplot
    • 70  Scatterplot
    • 71  Pearson Correlation
    • 72  Rank Correlation
    • 73  Partial Pearson Correlation
    • 74  Simple Linear Regression
    • 75  Moments
    • 76  Quantile-Quantile Plot (QQ Plot)
    • 77  Normal Probability Plot
    • 78  Probability Plot Correlation Coefficient Plot (PPCC Plot)
    • 79  Box-Cox Normality Plot
    • 80  Kernel Density Estimation
    • 81  Bivariate Kernel Density Plot
    • 82  Conditional EDA: Panel Diagnostics
    • 83  Bootstrap Plot (Central Tendency)
    • 84  Survey Scores Rank Order Comparison
    • 85  Cronbach Alpha

    • 86  Equi-distant Time Series
    • 87  Time Series Plot (Run Sequence Plot)
    • 88  Mean Plot
    • 89  Blocked Bootstrap Plot (Central Tendency)
    • 90  Standard Deviation-Mean Plot
    • 91  Variance Reduction Matrix
    • 92  (Partial) Autocorrelation Function
    • 93  Periodogram & Cumulative Periodogram

    • 94  Problems
  • Hypothesis Testing
    • 95  Normal Distributions revisited
    • 96  The Population
    • 97  The Sample
    • 98  The One-Sided Hypothesis Test
    • 99  The Two-Sided Hypothesis Test
    • 100  When to use a one-sided or two-sided test?
    • 101  What if \(\sigma\) is unknown?
    • 102  The Central Limit Theorem (revisited)
    • 103  Statistical Test of the Population Mean with known Variance
    • 104  Statistical Test of the Population Mean with unknown Variance
    • 105  Statistical Test of the Variance
    • 106  Statistical Test of the Population Proportion
    • 107  Statistical Test of the Standard Deviation \(\sigma\)
    • 108  Statistical Test of the difference between Means -- Independent/Unpaired Samples
    • 109  Statistical Test of the difference between Means -- Dependent/Paired Samples
    • 110  Statistical Test of the difference between Variances -- Independent/Unpaired Samples

    • 111  Hypothesis Testing for Research Purposes
    • 112  Decision Thresholds, Alpha, and Confidence Levels
    • 113  Bayesian Inference for Decision-Making
    • 114  One Sample t-Test
    • 115  Skewness & Kurtosis Tests
    • 116  Paired Two Sample t-Test
    • 117  Wilcoxon Signed-Rank Test
    • 118  Unpaired Two Sample t-Test
    • 119  Unpaired Two Sample Welch Test
    • 120  Two One-Sided Tests (TOST) for Equivalence
    • 121  Mann-Whitney U test (Wilcoxon Rank-Sum Test)
    • 122  Bayesian Two Sample Test
    • 123  Median Test based on Notched Boxplots
    • 124  Chi-Squared Tests for Count Data
    • 125  Kolmogorov-Smirnov Test
    • 126  One Way Analysis of Variance (1-way ANOVA)
    • 127  Kruskal-Wallis Test
    • 128  Two Way Analysis of Variance (2-way ANOVA)
    • 129  Repeated Measures ANOVA
    • 130  Friedman Test
    • 131  Testing Correlations
    • 132  A Note on Causality

    • 133  Problems
  • Regression Models
    • 134  Simple Linear Regression Model (SLRM)
    • 135  Multiple Linear Regression Model (MLRM)
    • 136  Logistic Regression
    • 137  Generalized Linear Models
    • 138  Multinomial and Ordinal Logistic Regression
    • 139  Cox Proportional Hazards Regression
    • 140  Conditional Inference Trees
    • 141  Leaf Diagnostics for Conditional Inference Trees
    • 142  Conditional Random Forests
    • 143  Hypothesis Testing with Linear Regression Models (from a Practical Point of View)

    • 144  Problems
  • Introduction to Time Series Analysis
    • 145  Case: the Market of Health and Personal Care Products
    • 146  Decomposition of Time Series
    • 147  Ad hoc Forecasting of Time Series
  • Box-Jenkins Analysis
    • 148  Introduction to Box-Jenkins Analysis
    • 149  Theoretical Concepts
    • 150  Stationarity
    • 151  Identifying ARMA parameters
    • 152  Estimating ARMA Parameters and Residual Diagnostics
    • 153  Forecasting with ARIMA models
    • 154  Intervention Analysis
    • 155  Cross-Correlation Function
    • 156  Transfer Function Noise Models
    • 157  General-to-Specific Modeling
  • Model Building Strategies
    • 158  Introduction to Model Building Strategies
    • 159  Manual Model Building
    • 160  Model Validation
    • 161  Regularization Methods
    • 162  Hyperparameter Optimization Strategies
    • 163  Guided Model Building in Practice
    • 164  Diagnostics, Revision, and Guided Forecasting
    • 165  Leakage, Target Encoding, and Robust Regression
  • References
  • Appendices
    • Appendices
    • A  Method Selection Guide
    • B  Presentations and Teaching Materials
    • C  R Language Concepts for Statistical Computing
    • D  Matrix Algebra
    • E  Standard Normal Table (Gaussian Table)
    • F  Critical values of Student’s \(t\) distribution with \(\nu\) degrees of freedom
    • G  Upper-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom
    • H  Lower-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom

Table of contents

  • 143.1 On the Relationship between Linear Regression Models and ANOVA
  • 143.2 Testing Regression Model Assumptions
  • 143.3 Testing Regression Coefficients
  • 143.4 Misspecification in Regression
  • 143.5 Multicollinearity in Regression
    • 143.5.1 Theoretical Aspects of Multicollinearity
    • 143.5.2 Detection of Multicollinearity
    • 143.5.3 Remedies for Multicollinearity
  • 143.6 How to Code Multiple Regression in R
  1. Regression Models
  2. 143  Hypothesis Testing with Linear Regression Models (from a Practical Point of View)

143  Hypothesis Testing with Linear Regression Models (from a Practical Point of View)

143.1 On the Relationship between Linear Regression Models and ANOVA

Let us reconsider the analysis that was presented in the discussion of the Two Way ANOVA model in Chapter 128. In the example we investigated the effect of two factors, i.e. Treatment with levels “A” and “B” and Gender with level “M” for males and “F” for females. The data was obtained within the framework of a well-designed, randomized experiment which implies that the underlying regression assumptions are most likely satisfied.

The MLRM for the Two Way ANOVA dataset can be formulated as follows:

\[ R_i = \alpha + \beta_1 T_i + \beta_2 G_i + \gamma T_i G_i + e_i \]

where \(i\) is the index of the test subject (for \(i=1, 2, …, N\)), \(\alpha, \beta_1, \beta_2,\) and \(\gamma\) are regression parameters which need to be estimated by OLS, \(T_i\) is the variable Treatment (coded as a binary variable with \(T_i = 1\) when Treatment is equal to category “A”), \(G_i\) is Gender (coded as a binary variable with \(G_i = 1\) for males), and \(e_i\) is the prediction error of the model. The product \(T_iG_i\) is called the interaction between \(T_i\) and \(G_i\) and simply represents the situation where a male participant receives Treatment = “A” (because \(T_i G_i = 1\) if \(T_i = G_i = 1\)).

The main goal of the analysis presented below is to find out how MLRM differs from ANOVA. At the same time, we will discuss various Hypothesis Tests that can be performed based on the model parameters.

The OLS algorithm estimates the model parameters (\(\alpha, \beta_1, \beta_2,\) and \(\gamma\)) in such a way that the sum of squared residuals, i.e. \(\sum_{i=1}^{N} e_i^2\), is minimized. The analysis shows the regression model parameters based on the OLS estimation procedure. The variables have been given text-based names: \(T_i\) is labeled “AorB”, \(G_i\) is named “MorF”, and \(T_i G_i\) is labeled “Interaction”.

Interactive Shiny app (click to load).
Open in new tab

These results imply that the average response \(R_i = 3.05\) for all female participants receiving Treatment = “B” (because when \(T_i = G_i = 0\) then the regression equation reduces to \(R_i = 3.05 + e_i\)). Hence, the constant or intercept of the regression equation represents the baseline against which every other situation is compared. If we would have coded \(T_i\) or \(G_i\) differently, we would see a different value for the intercept. Hence, it is always important to know how the variables are coded in order to understand the results of the MLRM.

The average response for females receiving treatment “A” (instead of “B”) can be easily computed as follows:

\[ \begin{aligned}R_i &= 3.05 - 1.6 \times 1 - 2.13875 \times 0 + 1.24875 \times 0 + e_i \\&= 3.05 - 1.6 + e_i \\&= 1.45 + e_i\end{aligned} \]

On average \(e_i = 0\) (this is an underlying assumption), hence the difference of responses between females receiving “B” and “A” is \(3.05 - 1.45 = 1.6\) which is exactly the result which was presented in the output of the Two Way ANOVA analysis in Chapter 128.

Would it be possible to obtain the Two Way ANOVA effect which corresponds to “`B:M-A:M”? To find out, we need to compute the difference between two responses:

  • The response of males receiving “B” is \(R_i - e_i = 3.05 - 1.6 \times 0 - 2.13875 \times 1 + 1.24875 \times 0 = 0.91125\)
  • The response of males receiving “A” is \(R_i - e_i = 3.05 - 1.6 \times 1 - 2.13875 \times 1 + 1.24875 \times 1 = 0.56\)

Hence, the treatment effect for males is equal to \(0.91125 - 0.56 = 0.35125\) which is, again, exactly what we found in the ANOVA analysis.

Would it be possible to obtain the Two Way ANOVA effect which corresponds to “`B:M-B:F”? To find out, we need to compute the difference between:

  • The response of males receiving “B” is \(R_i - e_i = 3.05 - 1.6 \times 0 - 2.13875 \times 1 + 1.24875 \times 0 = 0.91125\)
  • The response of females receiving “B” is \(R_i - e_i = 3.05 - 1.6 \times 0 - 2.13875 \times 0 + 1.24875 \times 0 = 3.05\)

Hence, the gender effect of treatment “B” is equal to \(0.91125 - 3.05 = -2.13875\) which is, yet again, exactly the result from the ANOVA output.

Similarly, we could replicate each of the interactions between Treatment and Gender which was presented in the Two Way ANOVA table. How about the main effects of Treatment and Gender?

To compute the Treatment effect “B-A” we have to use the following four responses:

  • The “B” response for females is \(R_i - e_i = 3.05 -1.6 \times 0 - 2.13875 \times 0 + 1.24875 \times 0 = 3.05\)
  • The “B” response for males is \(R_i - e_i = 3.05 -1.6 \times 0 - 2.13875 \times 1 + 1.24875 \times 0 = 0.91125\)
  • The “A” response for females is \(R_i - e_i = 3.05 -1.6 \times 1 - 2.13875 \times 0 + 1.24875 \times 0 = 1.45\)
  • The “A” response for males is \(R_i - e_i = 3.05 -1.6 \times 1 - 2.13875 \times 1 + 1.24875 \times 1 = 0.56\)

Now we can compute the average “B” response as \((3.05 + 0.91125) / 2 = 1.980625\) because there are exactly the same number of females and males1. Similarly, the average “A” response is \((1.45 + 0.56) / 2 = 1.005\) which can be subtracted from the average “B” response to obtain “B-A”, i.e. \(1.980625 - 1.005 = 0.975625\). Again, we find exactly the same result as in Two Way ANOVA.

Of course, it is not a coincidence that both methods yield exactly the same results. Moreover, it is always possible to rewrite the regression equation in such a way that the main effects and interactions of the ANOVA results are obtained. The Regression and ANOVA models are, mathematically speaking, equivalent. Sometimes ANOVA is easier to use (e.g. when the factors or exogenous variables are categorical) and in other situations (e.g. when the exogenous variables are continuous) the regression models are more practical.

Note that the first Figure in the R module’s output shows the actual responses (full line) and the predicted averages (\(R_i - e_i\)) for each group (dots).

143.2 Testing Regression Model Assumptions

The differences between the predicted levels and actual responses are called residuals or prediction errors (\(e_i\)). These residuals are visualized in the Figure shown below and can be used to assess the underlying assumptions of the MLRM.

Interactive Shiny app (click to load).
Open in new tab

For instance, we can observe that the residuals are (more or less evenly) fluctuating around zero, i.e. the model does not systematically under- or overestimate the actual values, implying that the mean error is probably close to zero2. Even though the variances of each group are not equal (this is called heteroskedasticity), we already know that the effect of unequal variances can be mostly neglected when the groups have equal size. Furthermore, we can examine the distribution of the residuals, as is shown in the following analysis:

Interactive Shiny app (click to load).
Open in new tab

The residuals are fairly well represented by a Normal Distribution (i.e. there seems to be no serious deviation from normality). This visual check supports, but does not prove, the normal-error assumption used for exact small-sample t- and F-inference. Independence and homoskedasticity are separate assumptions that should also be assessed.

Of course, it is possible to use hypothesis testing to assess whether the data provide evidence consistent with the assumptions that the residual mean is zero, the variances are equal, and the residual distribution is approximately normal. As discussed in previous chapters, there are many procedures that can be helpful in this respect.

There are a couple of important statistics which are commonly reported in regression analysis and have the following interpretation:

  • The “Multiple R” statistic is simply the Pearson Correlation between the prediction of the model and the endogenous/dependent variable.
  • The “R-squared” is the square of the “Multiple R” statistic -- this is also called the “Determination Coefficient” and can be interpreted as the percentage of the variability of the endogenous variable that is explained by the MLRM. Obviously, the R-squared is a value between 0 and 1.
  • The “Adjusted R-squared” is an alternative measure for the “R-squared” which is corrected for the “Degrees of Freedom” of the model. It is useful for comparing OLS models with the same dependent variable, but it should be interpreted together with diagnostics and out-of-sample performance.
  • The F-test value and associated p-value are identical to the F-test of the corresponding ANOVA analysis. This can also be interpreted as the Hypothesis Test about R-squared: H\(_0: R^2 = 0\) versus H\(_A: R^2 > 0\). If the null is not rejected, we do not have sufficient evidence that the model explains variance beyond an intercept-only model.
  • The “Residual Standard Deviation” is simply the standard deviation of the prediction errors or, in other words, the expected prediction error that will occur when a single prediction is made based on the MLRM.
  • The “Sum of Squared Residuals” is the criterion which is minimized in the OLS procedure.

To test whether or not the model satisfies the equal variances assumption there are several types of so-called Heteroskedasticity Tests available. The R module in RFC features the Goldfeld-Quandt test (Goldfeld and Quandt 1965) which assumes that:

  • the observations are ordered group-wise (if any groups exist)
  • or the observations are ordered in chronological order (in case we are dealing with time series)

In our example, the observations are ordered group-wise as can be clearly observed in the first Figure of the regression output.

The Hypothesis of the Goldfeld-Quandt test, applied to the residuals \(e_i\), can be formulated as follows:

\[ \begin{cases}\text{H}_0: \sigma_{i \leq j}^2 = \sigma_{i > j}^2 \\\text{H}_A: \sigma_{i \leq j}^2 \neq \sigma_{i > j}^2\end{cases} \]

where \(j\) is the so-called breakpoint which divides the data in two parts, \(\sigma_{i \leq j}^2\) is the Variance of the first section, and \(\sigma_{i > j}^2\) is the Variance of the second section of the dataset under investigation.

Since we have information about many (sequential) Hypothesis Tests, each of which has a type I error, multiplicity should be handled with a pre-specified primary test or a formal correction (e.g., Holm, Bonferroni, or FDR). The output Table labeled “Meta Analysis of Goldfeld-Quandt test for Heteroskedasticity” is descriptive and should not be used as a standalone decision rule3.

143.3 Testing Regression Coefficients

The regression output lists all coefficients of the model, along with their statistics. This allows us to test each individual regression parameter, i.e. \(\alpha, \beta_1, \beta_2,\) and \(\gamma\) from the theoretical model formulation.

The Intercept \(\hat{\alpha} = 3.05\) has a Standard Deviation of \(0.1\) which implies that the t-Test for the intercept

\[ t_{\alpha} = \frac{\hat{\alpha} - \alpha_0}{\hat{\sigma}_{\alpha}} = \frac{3.05 - 0}{0.1} \simeq 3.05e+01 \]

can be used to test the hypothesis

\[ \begin{cases}\text{H}_0: \alpha = \alpha_0 = 0 \\\text{H}_A: \alpha \neq \alpha_0 = 0\end{cases} \]

The corresponding one-sided and two-sided p-values are very small which means that we reject the Null Hypothesis (implying that the intercept is not zero).

The treatment effect \(\hat{\beta_1} = -1.6\) has a Standard Deviation of \(0.1415\) which implies that the t-Test for this parameter

\[ t_{\beta_1} = \frac{\hat{\beta_1} - \beta_{1,0}}{\hat{\sigma}_{\beta_1}} = \frac{-1.6 - 0}{0.1415} \simeq -1.13e+01 \]

can be used to test the hypothesis

\[ \begin{cases}\text{H}_0: \beta_1 = \beta_{1,0} = 0 \\\text{H}_A: \beta_1 \neq \beta_{1,0} = 0\end{cases} \]

The corresponding one-sided and two-sided p-values are very small which means that we reject the Null Hypothesis (implying that \(\beta_1\) is not zero).

Similarly, we can draw the conclusion that \(\beta_2\) and \(\gamma\) are significantly different from zero (both have extremely small p-values).

The overall conclusion from the MLRM is that the main effects of the treatment and gender are both significant. In addition, the combined effect is significant as well, implying that gender interacts with the effect of the treatment (the treatment effect is not equal for males and females).

Note that in regression analysis the t-Test about the parameters, by convention, assumes that the Null value is equal to zero, i.e. \(\alpha_0 = \beta_{1,0} = \beta_{2,0} = \gamma_0 = 0\). Since this is not always appropriate it is fairly easy to build 95% confidence intervals based on the rough approximation that the lower and upper bounds are two Standard Deviations away from the parameter estimates. For instance, the approximate 95% interval of the estimated interaction parameter \(\hat{\gamma}\) is \([1.249 - 2 \times 0.2, 1.249 + 2 \times 0.2]\). With these confidence intervals it is easy to test the parameter against any Null value of interest.

143.4 Misspecification in Regression

In regression models it is almost always assumed that theoretical insights provide a “true” specification of the stochastic equation to be estimated. It is therefore obvious that if the (economic) theory is incomplete, the estimated parameters will (almost certainly) be biased. This bias which is introduced by “unobserved” (or forgotten) variables is called the “unobserved variables bias” (UVB).

Formally the UVB for a MLRM

\[ y = Xb + e \]

can be illustrated, assuming the presence of unobserved variables \(Z\) with a corresponding parameter vector \(c\), by rewriting the OLS parameter equation as follows

\[ \begin{align*}b &= (X'X)^{-1}X'y \\&= (X'X)^{-1}X'(X\beta + Z c + e) \\&= \beta + (X'X)^{-1}X'Zc + (X'X)^{-1}X'e\end{align*} \]

where \((X'X)^{-1}X'Zc\) are the so-called auxiliary regressions of the omitted variables on the exogenous variables \(X\) of the original (i.e. biased) equation.

It is also interesting to see that the residual variance is biased because

\[ \hat{\sigma}^2 = \frac{y'(I - X(X'X)^{-1}X')y}{n-k} \]

or on using the true model \(y = X\beta + Zc + e\)

\[ \begin{align*}\text{if } \Theta &\equiv (I - X(X'X)^{-1}X') \\&\Downarrow \\\hat{\sigma}^2 &= \frac{(X\beta + Zc + e)'\Theta(X\beta + Zc + e)}{n-k} \\&= \frac{\left(\left[\begin{matrix} X & Z \end{matrix}\right] \left[\begin{matrix} \beta \\ c \end{matrix}\right] + e \right)'\Theta\left(\left[\begin{matrix} X & Z \end{matrix}\right] \left[\begin{matrix} \beta \\ c \end{matrix}\right] + e \right)}{n-k} \\&= \frac{1}{n-k} \left( \left[ \begin{matrix} \beta \\ c \end{matrix} \right]' \left[ \begin{matrix} X & Z \end{matrix} \right]' \Theta \left[ \begin{matrix} X & Z \end{matrix} \right] \left[ \begin{matrix} \beta \\ c \end{matrix} \right] + e'\Theta e + 2 \left[ \begin{matrix} \beta \\ c \end{matrix} \right]' \left[ \begin{matrix} X & Z \end{matrix} \right]' \Theta e \right)\end{align*} \]

and since

\[ \text{E}(e'\Theta e) = \operatorname{tr}\!\left(\text{E}(\Theta ee')\right) = \operatorname{tr}\!\left(\Theta\,\text{E}(ee')\right) = \sigma^2 \operatorname{tr}(\Theta) = (n-k)\sigma^2 \]

(because \(\Theta\) is an idempotent matrix).

and since we assume that E\((e) = 0\), it follows that

\[ \text{E} (\hat{\sigma}^2) = \sigma^2 + \frac{\left[ \begin{matrix} \beta \\ c \end{matrix} \right]' \left[ \begin{matrix} X & Z \end{matrix} \right]' \Theta \left[ \begin{matrix} X & Z \end{matrix} \right] \left[ \begin{matrix} \beta \\ c \end{matrix} \right]}{n-k} \]

such that E\((\hat{\sigma}^2) \geq \sigma^2\) (Q.E.D.).

A special kind of misspecification (i.e. one that is introduced by misspecified functional forms of the relationships between the dependent and independent variables) can (sometimes) be detected through the so-called Ramsey RESET specification test (Ramsey 1969). This test is based on a test-equation in which the original specification is expanded with powers of the interpolation fit (powers of 2, 3, etc.). The number of terms to be added is user-defined and the corresponding parameters can be tested with an ordinary F-test. If the Null Hypothesis of this F-test is rejected, it is concluded that the added terms contribute to the prediction of the endogenous variable, which automatically implies that the original equation is most likely misspecified.

The implementation of the Ramsey RESET test in RFC uses three test equations, each of which is based on different terms: powers (2 and 3) of the fitted values, powers (2 and 3) of the actual regressors, and powers (2 and 3) of the principal components. For each of these test equations the F-test is computed.

143.5 Multicollinearity in Regression

143.5.1 Theoretical Aspects of Multicollinearity

As explained in Section 135.1.8, the parameters from estimating repeated SLRMs are identical to the results of a single MLRM where all parameters are computed simultaneously, provided the exogenous variables are orthogonal (independent). However, when two or more exogenous variables are mutually correlated, they are not independent (which is commonly referred to as multicollinearity).

Consider the simple case of two vectors \(x_1\) and \(x_2\) which are taken from the set of \(n\) exogenous variables. These vectors are called multicollinear if

\[ \forall x_i, \exists c_i \in \mathbb{R}^2: \sum_{i=1}^{2} c_i x_i = 0 \]

which means that both vectors are perfectly correlated (linearly dependent).

In this case it is not possible to make a distinction between the effects of \(x_1\) and \(x_2\) on the endogenous variable because they are both indistinguishable (we call this perfect multicollinearity). Mathematically speaking, the OLS parameters cannot be computed because the OLS estimator \(b = (X'X)^{-1} X'y\) contains a singular matrix \((X'X)\) which cannot be inverted. In this sense, it is a strict requirement that the \(X'X\) matrix is non-singular.

Now we consider the (more realistic) case of exogenous variables which are mutually correlated (but without being perfectly, linearly dependent). Denote \(X'X\) as a square \(K*K\) matrix \(A\) and consider an orthogonal \(K*K\) matrix \(G\) with columns \(g_1, g_2, …, g_K\). The columns \(g_i\) are assumed to be the eigenvectors (corresponding to different latent roots) of the square matrix \(A\).

Since \(GG' = I\) we can write the MLRM as

\[ y = (XG)(G'\beta) + e = Z \delta + e \]

where the columns of \(Z\) are called “principal components” which are (by definition) orthogonal with respect to each other and have a length which is equal to their eigenvalue.

Since \(X'X\) (equivalently \(Z'Z\)) is symmetric and positive semi definite, it follows that the eigenvalues are real and non-negative. In addition, it can be easily verified that

\[ Z'Z = G'AG = \Lambda = \left[\begin{matrix}\lambda_1 & 0 & 0 & … & 0 \\0 & \lambda_2 & 0 & … & 0 \\0 & 0 & \lambda_3 & … & 0 \\… & … & … & … & … \\0 & 0 & 0 & … & \lambda_K\end{matrix}\right] \]

Now it is possible to write the OLS estimator for the principal components parameter vector

\[ \hat{\delta} = (Z'Z)^{-1} Z' y = \Lambda^{-1} Z'y \]

and, more importantly, for the parameter covariance matrix

\[ \text{E}\left[\left( \hat{\delta} - \text{E}\left( \hat{\delta} \right) \right) \left( \hat{\delta} - \text{E}\left( \hat{\delta} \right) \right)'\right] = \sigma^2 \Lambda^{-1} \Rightarrow \text{V}\left( \hat{\delta}_j \right) = \frac{\sigma^2}{\lambda_j} \]

from which it can be deduced that the variance of \(\hat{\delta}_j\) becomes large when the eigenvalue is small (smaller than 1). In the extreme case where the eigenvalue converges to zero, the variance becomes extremely large (tending towards infinity in the case of perfect multicollinearity).

How does this result affect the variance of the parameters in the original MLRM? To answer this question, we need to consider the fact that

\[ G'\hat{\beta} = \hat{\delta} \Rightarrow \hat{\beta} = G \hat{\delta} \]

and

\[ \begin{align*}G'X'XG &= \Lambda \\X'X &= G \Lambda G' \\(X'X)^{-1} &= (G \Lambda G')^{-1} \\(X'X)^{-1} &= (G')^{-1} (G \Lambda)^{-1}\\(X'X)^{-1} &= G \Lambda^{-1} G'\end{align*} \]

Now the covariance matrix of the original parameters can be written as

\[ \text{E} \left[ (b - \beta)(b - \beta)' \right] = \sigma^2 G \Lambda^{-1} G' \Rightarrow \text{V} (b_i) = \sigma^2 \sum_{j=1}^{K} \frac{g_{ij}^2}{\lambda_j} \]

which does not necessarily imply that low lambda values yield high variances (this also depends on the other parameters in the regression). Therefore, in general, low lambda values do not always imply that the variances are inflated (but it is possible).

143.5.2 Detection of Multicollinearity

From the previous section we can conclude that strong linear associations between exogenous variables are not always catastrophic. Hence the proposed multicollinearity diagnostics should be used with care.

143.5.2.1 Correlation Matrix

The correlation matrix can be used as a preliminary screening tool for linear associations among exogenous variables. However, pairwise correlations alone are not a reliable diagnostic for harmful multicollinearity; use established diagnostics such as VIF, condition indices/eigenvalues, and variance-decomposition proportions for inference.

143.5.2.2 High R-squared and low t-Statistics

When the determination coefficient (\(R^2\)) is reasonably high while all or most regression parameters have low t-Statistics (i.e. they are not significantly different from zero), this may be viewed as a symptom of harmful multicollinearity. As explained before, multicollinearity manifests itself by inflating the variances (which decreases the t-Statistics of the parameters) -- it does not, however, affect the ability of the model to explain the endogenous variable (hence the \(R^2\) might still be reasonably high).

143.5.2.3 High R-squared and low partial correlation between an exogenous variable and the dependent variable

Again, low partial correlations between exogenous and endogenous variables, is symptomatic for multicollinearity. In particular, if the partial correlation is much lower than the ordinary Pearson correlation, it can be concluded that the exogenous variable does not contribute more than what is already explained by the other variables.

143.5.2.4 Variance Inflation Factors

It is obvious that multicollinearity may exist when the exogenous variables have an impact on each other. This can be measured by the R-squared of the auxiliary regression which explains any chosen exogenous variable based on all other \(K - 1\) variables (including a constant term). The higher the R-squared of such an auxiliary regression, the higher the degree of multicollinearity.

The respective F-statistic may be calculated as follows

\[ F_i = \frac{\frac{R_{x_i=f(x_{j \neq i})}^2}{K-2}}{\frac{1 - R_{x_i=f(x_{j \neq i})}^2}{N-K+1}} \]

The same information can be expressed in terms of the so-called Variance Inflation Factors:

\[ \text{VIF}_i = \frac{1}{(1-R_i^2)} \]

where \(R_i^2\) is the determination coefficient of the auxiliary regression which explains \(x_i\). As the name indicates, VIF measures the factor by which the parameter’s variance is multiplied (inflated) as compared to an orthogonal regression (where all exogenous variables are independent).

143.5.3 Remedies for Multicollinearity

143.5.3.1 Drop spurious exogenous variables

Suppose we were interested in the estimation of the model

\[ Y_t = \alpha + \beta_G G_t + \beta_I I_t + \beta_H H_t + \beta_L L_t + \beta_A A_t + e_t \]

where \(G_t, I_t, H_t, L_t\), and \(A_t\) are exogenous variables.

In addition, suppose that harmful multicollinearity would have been discovered between the first three (\(G_t, I_t\), and \(H_t\)) and the last two variables (\(L_t\) and \(A_t\)). In this scenario we may select one representative of each group (e.g. \(G_t\) and \(L_t\)) while all the other exogenous variables are dropped (because they do not contain any additional information which is not already present in either \(G_t\) or \(L_t\)).

143.5.3.2 Principal Components

As explained before, \(X'X\) can be diagonalized and written in terms of eigenvectors and eigenvalues. Accordingly, the linear model can be written in terms of its principal components. The first principal component can intuitively be interpreted as the summary of all exogenous variables by one column vector which explains as much of X as possible. The remaining information is contained in the second principal component and so on … It is, however, important to note that the principal components are orthogonal and therefore cannot be multicollinear.

Suppose we would have computed the principal components for the model of the previous section. Also assume that the principal components (PC) contain (in descending order) 90%, 5%, 4%, … of the total variance of the exogenous variables. In such circumstances we would retain the first three PCs in our regression model since they account for 99% of the variance of \(X\).

When having three PCs in a regression model, this means that there are three important groups of variables (within the set of X) which are explaining the endogenous variable. Cross correlations between the exogenous variables and the PCs should reveal which variables may be associated with different factors (this is necessary for interpretation purposes).

Now suppose that this regression would result in only the first PC’s parameter to be significantly different from zero. In this case our model would reduce to a simple regression. In this particular case we can look at the correlations between the exogenous variables and the first PC to determine which of the original variables is important. However, if more than just one PC has a significant regression parameter, it is much more difficult to assess the importance of each of the original variables.

Therefore, in such circumstances, it could be better to compute the PC for both subgroups that we have detected before. We may present the X matrix as follows

\[ X = \left[ \left.\begin{matrix} G_1 & I_1 & H_1 \\G_2 & I_2 & H_2 \\… & … & … \\G_T & I_T & H_T \\\end{matrix}\right\| \left.\begin{matrix}L_1 & A_1 \\L_2 & A_2 \\… & … \\L_T & A_T \\\end{matrix}\right. \right] = \left[\begin{matrix}S & T\end{matrix}\right] \]

and compute the PC for S and T separately. This process will probably result in at least one significant PC-parameter per subgroup in a multiple regression -- therefore, it is rather easy to interpret the model. Note, however, that in this case there is no reason to assume that the first PC of S and the first PC or T are not multicollinear. More importantly, since both PCs have been computed separately, and since our splitting of the variables into two subgroups, might have been wrong, there is no guarantee that the PCs are optimal in a global sense (for the complete regression model).

Due to these problems, the PC approach has not been implemented in RFC. Instead, there is an alternative approach (i.e. Partial Least Squares) which is, statistically speaking, superior because it does not suffer from these shortcomings and provides easily interpretable results at the same time.

143.6 How to Code Multiple Regression in R

Even though the Multiple Regression module can be found on the public website (https://compute.wessa.net/rwasp_multipleregression.wasp) and in RFC (under the Models / Manual Model Building menu), it is easy to reproduce the ANOVA-equivalent regression analysis locally with a compact script:

set.seed(123)

# Balanced 2x2 design (Treatment x Gender), 8 observations per cell
n_cell <- 8
AorB <- rep(c('B', 'A'), each = 2 * n_cell)
MorF <- rep(rep(c('F', 'M'), each = n_cell), times = 2)

# Cell means chosen to match the chapter discussion
mu <- ifelse(AorB == 'B' & MorF == 'F', 3.05,
      ifelse(AorB == 'A' & MorF == 'F', 1.45,
      ifelse(AorB == 'B' & MorF == 'M', 0.91125, 0.56)))

R <- mu + rnorm(length(mu), mean = 0, sd = 0.25)
df <- data.frame(
  R = R,
  AorB = factor(AorB, levels = c('B', 'A')),
  MorF = factor(MorF, levels = c('F', 'M'))
)

# MLRM equivalent to two-way ANOVA with interaction
fit <- lm(R ~ AorB * MorF, data = df)
summary(fit)

# Same model in ANOVA form
anova(fit)

Call:
lm(formula = R ~ AorB * MorF, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38813 -0.14488 -0.03331  0.16449  0.37817 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.10871    0.08313  37.395  < 2e-16 ***
AorBA       -1.79249    0.11756 -15.247 4.34e-15 ***
MorFM       -2.12890    0.11756 -18.108  < 2e-16 ***
AorBA:MorFM  1.33913    0.16626   8.054 9.04e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2351 on 28 degrees of freedom
Multiple R-squared:  0.952, Adjusted R-squared:  0.9469 
F-statistic: 185.2 on 3 and 28 DF,  p-value: < 2.2e-16

Analysis of Variance Table

Response: R
          Df  Sum Sq Mean Sq F value    Pr(>F)    
AorB       1 10.0876 10.0876 182.464 8.690e-14 ***
MorF       1 17.0373 17.0373 308.168 < 2.2e-16 ***
AorB:MorF  1  3.5866  3.5866  64.873 9.039e-09 ***
Residuals 28  1.5480  0.0553                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
par(mfrow = c(2, 2))
plot(fit)
par(mfrow = c(1, 1))
Figure 143.1: Residual diagnostics for the ANOVA-equivalent MLRM
# Goldfeld-Quandt test (ordered by fitted values)
if (requireNamespace('lmtest', quietly = TRUE)) {
  print(lmtest::gqtest(fit, order.by = fitted(fit), alternative = 'two.sided'))
}

# Variance inflation factors
if (requireNamespace('car', quietly = TRUE)) {
  print(car::vif(fit))
}

    Goldfeld-Quandt test

data:  fit
GQ = 1.0096, df1 = 12, df2 = 12, p-value = 0.9871
alternative hypothesis: variance changes from segment 1 to 2

     AorB      MorF AorB:MorF 
        2         2         3 

This compact workflow focuses on the chapter’s core objective: linking two-way ANOVA and MLRM, then checking residual assumptions and coefficient inference.

Goldfeld, Stephen M., and Richard E. Quandt. 1965. “Some Tests for Homoscedasticity.” Journal of the American Statistical Association 60 (310): 539–47. https://doi.org/10.1080/01621459.1965.10480811.
Ramsey, J. B. 1969. “Tests for Specification Errors in Classical Linear Least-Squares Regression Analysis.” Journal of the Royal Statistical Society: Series B (Methodological) 31 (2): 350–71. https://doi.org/10.1111/j.2517-6161.1969.tb00796.x.

  1. If the number of females and males were unequal, we would have to use a weighted mean instead.↩︎

  2. The summary statistics are shown at the bottom of the output and confirm that the Arithmetic Mean is equal to 0.00000 (rounded to 5 decimal places).↩︎

  3. Use a formal multiple-testing correction (e.g., Holm, Bonferroni, or FDR) when combining evidence from several tests.↩︎

142  Conditional Random Forests

© 2026 Patrick Wessa. Provided as-is, without warranty.

Feedback: e-mail | Anonymous contributions: click to copy (Sats) | click to copy (XMR)

Cookie Preferences