121 Mann-Whitney U test (Wilcoxon Rank-Sum Test)

The Wilcoxon Rank-Sum Test is not the same as the Wilcoxon Signed-Rank Test as described in Chapter 117. The latter is exclusively used for paired/dependent samples as an alternative for the Paired Two Sample t-Test.

The Wilcoxon Rank-Sum Test (Wilcoxon 1945) (also commonly referred to as the Mann-Whitney or Wilcoxon-Mann-Whitney U Test (Mann and Whitney 1947)) is used as an alternative for the Unpaired Two Sample t-Test and Welch Test. The main advantage of both types of Wilcoxon tests is that they are non-parametric which implies that there is no need to make the usual assumptions that are associated to the Central Limit Theorem.

121.1 Hypotheses

The Hypotheses which are tested by the Wilcoxon Rank-Sum Test depend on the type of inference that is performed, i.e. the classical population model, or the randomization model which is mostly used in medical research.

121.1.1 Classical model

In this setting we test whether the two populations have the same distribution (Null Hypothesis) or not (Alternative Hypothesis). The Wilcoxon Rank-Sum Test is used to test whether one distribution is “shifted” by a constant amount when compared to the other. When the Null Hypothesis is not rejected then there is no evidence of a location shift between the distributions. When the Null Hypothesis is rejected, one distribution may be shifted to the right or left in comparison to the other population.

Important: This location-shift interpretation is only valid when both populations have the same distributional shape (i.e., identical variance, skewness, and kurtosis). Without this assumption, a significant result could reflect differences in shape or spread rather than location. The equal-shape assumption can be assessed using diagnostics such as the Kolmogorov-Smirnov Test on centered samples (see below), but a non-significant result does not prove equal shape.

121.1.2 Randomization model

In this setting the Wilcoxon Rank-Sum Test is used to test differences between randomized groups in terms of their mean ranks. Unlike the classical model, the mean-rank interpretation does not require the equal-shape assumption—the test validly compares mean ranks regardless of whether the distributions have the same shape.

However, when shapes differ, a significant result only tells us that mean ranks differ; it does not indicate why they differ (location, spread, or shape). For this reason, the equal-shape assumption remains useful for substantive interpretation, even though it is not required for the test’s validity.

In many textbooks and software manuals, the Wilcoxon Rank-Sum Test is said to test for equality of group medians, which is actually wrong.

Mean ranks are not the same as medians, which can be easily illustrated with a simple example. Suppose we have two experimental groups (a control and a treatment group). Also assume that the measurements \(x_{1i}\) of the control group are all the same (e.g. \(x_{1i} = 15\)), implying that the control observations receive the same tied mid-rank in the combined ranking. Hence, the mean rank of the control group equals that shared mid-rank (not simply the number of data points \(n\)).

Now assume that the treatment group has a non-zero variance with measurements varying around 15. This implies that not all measurements are equal and that the ranks will not all be equal. As a consequence the mean rank can differ from the control group’s mean rank while it is still possible that the median is 15.

The example illustrates that it is possible that the medians in both groups are equal, while the mean ranks are not. The treatment of ties plays a very important role in this respect.

121.2 Analysis based on p-values

Again, consider the analysis that was presented for the ordinary Unpaired Two Sample t-Test. We only need to consider the case of the two-sided Hypothesis Test to illustrate the Wilcoxon Rank-Sum Test (the one-sided tests can be interpreted in similar ways).

The analysis shown below is a copy of the example shown in the previous section. The output shows the results of the Wilcoxon Rank-Sum Test which is produced in the same computation as the ordinary Unpaired Two Sample t-Test.

Interactive Shiny app (click to load).

Open in new tab

The p-value of the Wilcoxon Rank-Sum Test is (approximately) 2.24e-11 which is smaller than the chosen type I error level of 5%. We conclude that the Null Hypothesis should be rejected, implying that (depending on whether the classical or randomization model is appropriate) either one of the following statements can be made:

both populations have similar distributions but are shifted along the x-axis by a constant amount
the mean ranks of both populations are (significantly) different from each other

As stated before, the Wilcoxon Rank-Sum Test does not make distributional assumptions (because it is a non-parametric procedure). What this means is that we do not have to assume any specific continuous distribution (such as the Normal Distribution). This, however, does not imply that there are no assumptions that should be satisfied when using this test (as explained in the classical model).

The main assumption made by the Wilcoxon Rank-Sum Test is that both populations have “similar” distributions in the sense that they have the same shape. The so-called Kolmogorov Smirnov Test (KS Test) can be used as a diagnostic to test whether two samples are drawn from the same continuous distribution (i.e. the Null Hypothesis) or not. When the p-value of the KS Test is small, we reject the Null Hypothesis and conclude that the distributions are not equal.

In the output there are two KS Tests: one to compare the distributions and one which is related to the shapes of both distributions.

121.2.1 KS Test for distributions

First it is tested whether the distributions are equal (H\(_0\)). The p-value is 7.634e-10 which leads us to conclude that both distributions are different (H\(_0\) is rejected). This computation, however, is not really appropriate to test the assumption that underlies the Wilcoxon Rank-Sum Test because two distributions can be different due to inequality of the location or inequality of the shape (or both). Since the Wilcoxon Rank-Sum Test revealed that the location of both distributions is different, the KS Test should not simply compare the distributions as a whole but only take into account the shape.

121.2.2 KS Test for distributional shapes

We can explore whether the shape of the distributions are equal by applying the KS Test on the centered samples (i.e. by subtracting the sample means from the observations). The second KS Test is useful as a shape-focused diagnostic because centering removes the location difference from the comparison.

The results clearly show that H\(_0\) should not be rejected: the p-value is (approximately) 0.7672, so we fail to reject the Null Hypothesis. Hence, we do not detect a shape difference in this sample; this is consistent with (but does not prove) the equal-shape assumption of the Wilcoxon Rank-Sum Test.

For two groups of sizes \(n_1\) and \(n_2\), let \(R_1\) be the sum of ranks for group 1. The Mann-Whitney statistic is

\[ U_1 = R_1 - \frac{n_1(n_1+1)}{2}, \quad U_2 = n_1 n_2 - U_1, \quad U = \min(U_1,U_2). \]

With ties, the p-value computation uses a tie correction (or exact methods when available for small samples without ties). In R, wilcox.test handles this automatically.

For reporting, add an effect size such as rank-biserial correlation:

\[ r_{rb} = 1 - \frac{2U}{n_1 n_2} \]

To compute the Mann-Whitney U test (Wilcoxon Rank-Sum Test) on your local machine, the following script can be used in the R console:

set.seed(123)
A <- runif(15, 1, 7) + 2
B <- runif(15, 1, 7)
x <- cbind(A, B)
par1 = 1 #column number of first sample
par2 = 2 #column number of second sample
par3 = 0.95 #confidence (= 1 - alpha)
par4 = 'two.sided'
par5 = 'unpaired'
par6 = 0.0 #Null Hypothesis
main = 'Two Samples'
if (par5 == 'unpaired') paired <- FALSE else paired <- TRUE
(wilcox.test(x[,par1], x[,par2], alternative=par4, paired=paired, mu=par6, conf.level=par3))
(ks.test(x[,par1], x[,par2], alternative=par4))
m1 <- mean(x[,par1],na.rm=T)
m2 <- mean(x[,par2],na.rm=T)
mdiff <- m1 - m2
newsam1 <- x[!is.na(x[,par1]),par1]
newsam2 <- x[,par2]+mdiff
newsam2 <- newsam2[!is.na(newsam2)]
(ks.test(newsam1, newsam2, alternative=par4))


    Wilcoxon rank sum exact test

data:  x[, par1] and x[, par2]
W = 175, p-value = 0.008642
alternative hypothesis: true location shift is not equal to 0


    Exact two-sample Kolmogorov-Smirnov test

data:  x[, par1] and x[, par2]
D = 0.53333, p-value = 0.02625
alternative hypothesis: two-sided


    Exact two-sample Kolmogorov-Smirnov test

data:  newsam1 and newsam2
D = 0.2, p-value = 0.9383
alternative hypothesis: two-sided

Note that the above script assume that the data has a wide format. The wilcox.test can, of course, also be used with the formula syntax when the dataset has a long format.

121.3 Assumptions

As mentioned in the previous section, we assume that the distributions of both populations are similar in terms of shape.

121.4 Alternatives

The alternative of this test are explained in Section 118.4.