119 Unpaired Two Sample Welch Test

119.1 Hypotheses

As explained in the discussion about the (ordinary) Unpaired Two Sample t-Test, the Variances are usually assumed to be unknown but equal. This, however, is not always a realistic assumption which implies that we need to address the case (i.e. case # 4) where \(\sigma_1^2 \neq \sigma_2^2\).

The theoretical treatment of this case and the corresponding method is commonly referred to as the “Welch Test” (Welch 1947), even though it is not always featured in textbooks or statistical software packages.

For practical purposes, there is clearly no reason why one would want to use the ordinary (case #3) t-Test instead of the Welch Test (case # 4). Even if both Variances are equal (\(\sigma_1^2 = \sigma_2^2\)) the Welch Test still provides the correct answer.

Decision Threshold Choice

Role of Welch’s test: usually confirmatory (main mean-comparison claim).
Threshold choice: choose and justify the confirmatory significance level for the mean comparison (often 1% to 5% in confirmatory work).
Variance F-test: when Welch’s test is used for the main comparison, there is usually no reason to report the equal-variance F-test. An exception may be a pedagogical context where the goal is to illustrate why Welch’s procedure is preferred to the classical equal-variance t-test.
Reporting: include the mean-difference estimate, confidence interval, and an effect size (not only the p-value).

This chapter fits the broader decision-threshold framework explained in Chapter 112.

119.2 Analysis based on p-values

Consider the analysis that was presented for the ordinary Unpaired Two Sample t-Test. We only need to consider the case of the two-sided Hypothesis Test to illustrate the Welch Test (the one-sided tests can be interpreted in similar ways).

The analysis shown below is a copy of the example shown in the previous section. It displays the results from the ordinary t-Test and contains information about the ratio of sample variances which is used to test whether the ratio of population variances is equal to one or not. Using a 95% confidence interval the Null Hypothesis (which states that both variances are equal) is not rejected. The corresponding p-value is 0.2336 which is much larger than common type I error levels (hence we fail to reject the Null Hypothesis).

Interactive Shiny app (click to load).

Open in new tab

If the equal-variance assumption is doubtful, use the Welch test directly. In modern practice, Welch’s procedure is often preferred by default for unpaired comparisons because it remains valid under unequal variances and performs very similarly when variances are equal.

We now have to change the setting for “Type of test to use” to “Two Sample t Test (unequal variance)” which causes the analysis to be recomputed.

Both the confidence interval and the p-value are different from those of the “ordinary” test. In this case, the difference is rather small (almost negligible), which is a consequence of the fact that the variances are close. As soon as the variances deviate from each other, the difference between the p-values of the ordinary t-Test and the Welch Test may become much larger (the same applies to the confidence intervals).

The Welch test statistic is

\[ t = \frac{(\bar{x}_1-\bar{x}_2)-\mu_0}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}} \]

with approximate degrees of freedom (Welch 1947; Satterthwaite 1946)

\[ \nu = \frac{\left(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1-1}+\frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2-1}}. \]

In this example, the conclusions of both procedures yield the same conclusion. This, however, will not necessarily be the case when the variances are unequal.

To compute the Welch Test on your local machine, the following script can be used (for wide format data) in the R console.

Note: this local script is a synthetic template. The embedded app example above uses the Pima.tr dataset and therefore has different numeric output.

set.seed(123)
A <- runif(15, 1, 7)
B <- runif(15, 1, 7)
x <- cbind(A, B)
par1 = 1 #column number of first sample
par2 = 2 #column number of second sample
par3 = 0.95 #confidence (= 1 - alpha)
par4 = 'two.sided'
par5 = 'unpaired'
par6 = 0.0 #Null Hypothesis
if (par5 == 'unpaired') paired <- FALSE else paired <- TRUE
(t.test(x[,par1], x[,par2], var.equal=FALSE, alternative=par4, paired=paired, mu=par6, conf.level=par3))


    Welch Two Sample t-test

data:  x[, par1] and x[, par2]
t = -0.049545, df = 27.944, p-value = 0.9608
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.362979  1.298609
sample estimates:
mean of x mean of y 
 4.418309  4.450493

The code can also be written for long format data as follows:

x = data.frame(measurement = c(A, B), group = c(rep("A", 15), rep("B", 15)))
par3 = 0.95 #confidence (= 1 - alpha)
par4 = 'two.sided'
# par5 = 'unpaired'
par6 = 0.0 #Null Hypothesis
# if (par5 == 'unpaired') paired <- FALSE else paired <- TRUE
(t.test(measurement ~ group, var.equal=FALSE, alternative=par4, mu=par6, conf.level=par3, data = x))


    Welch Two Sample t-test

data:  measurement by group
t = -0.049545, df = 27.944, p-value = 0.9608
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
 -1.362979  1.298609
sample estimates:
mean in group A mean in group B 
       4.418309        4.450493

119.3 Assumptions

The assumptions of this test are similar to those explained in Section 118.3, except that Welch’s test does not require equal population variances.

119.4 Alternatives

The alternative of this test are explained in Section 118.4.