133 Problems

133.1 Hypothesis Testing

133.1.1 Task 1

Use the Computation tab and repeat the One Sample t-Test with $\mu_0 = 49.5$. Compare the results from the two-sided and one-sided tests. Try to explain why the results seem to contradict each other.

Interactive Shiny app (click to load).

Open in new tab

When we recompute the analysis with $\mu_0 = 49.5$ instead of 50, the results show that the Null value is contained in the two-sided confidence interval around the mean ($\mu_0 = 49.5 \in [49.0, 58.4]$). Therefore we fail to reject the Null Hypothesis H$_0: \mu = \mu_0 = 49.5$.

The Null value is, however, not contained in the left-sided confidence interval ($\mu_0 = 49.5 \notin [49.9, +\infty]$). Hence, we reject the Null Hypothesis and accept the Alternative H$_A: \mu > \mu_0 = 49.5$).

Both results seem to contradict each other because $\mu > \mu_0 \Rightarrow \mu \neq \mu_0$. How could things have gone wrong?

The answer is simple, if we consider the interpretation of the underlying hypotheses that are tested. In the first case, we test a two-sided hypothesis because we don’t know whether the mean will be larger or smaller than 49.5. In other words, there is no a priori knowledge about the deviation of the mean (from the Null value of 49.5). In the second case, however, we implicitly assume (with certainty) that $\mu$ cannot be smaller than $\mu_0 = 49.5$. In this case, there is deterministically defined prior knowledge about the deviation of the mean from $\mu_0$ (i.e. it can only deviate in one direction). Hence, the formulation of a one-sided Hypothesis Test is always accompanied by an implicit assumption about the result that is going to be observed.

The computations in this Task illustrate that prior knowledge is not neutral because it has an impact on the critical values (and confidence intervals) that are used in testing hypotheses. Hence, if we conclude (as is done in the second case) that $\mu > 49.5$ then we have to remind ourselves of the fact that this statement depends on the deterministic assumption about the relationship between $\mu$ and $\mu_0$.

133.1.2 Task 2

Examine the Skewness & Kurtosis tests presented in the Computation tab. What is your conclusion if your preferred significance level is 1%?

Interactive Shiny app (click to load).

Open in new tab

The p-value of the D’Agostino Skewness Test is (approximately) 0.5817 which extremely large. Hence we do not reject the Null Hypothesis which states that the skewness of the data is zero (i.e. the data are symmetrically distributed).

The p-value of the Anscombe-Glynn Kurtosis Test is (approximately) 0.03698 which is, assuming a chosen type I error level of 1%, larger than $\alpha$ which would lead us to fail to reject the Null Hypothesis.

Using $\alpha=0.01$, however, does not comply with a very rigorous and scientific attitude. The reason is simply because the Null Hypothesis corresponds to what we wish to prove (i.e. this is a so-called “diagnostic” test which tests whether the distribution deviates from normality). Since the default position (i.e. Null Hypothesis) is related to what we wish to prove, we should not be primarily concerned with the type I error (of making a false “positive”) -- on the contrary, we should be interested in the type II error (of drawing the conclusion that our data is normally distributed while, in fact, it is not). As it is not possible to compute the type II error in this setting, we need to be careful about the choice of the type I error level. The conservative/scientific thing to do is to increase the chosen type I error level in order to reduce the type II error. Raising $\alpha$ from 1% to, say 10%, could (theoretically) result in a drastic reduction of the type II error $\beta$ because the relationship between $\alpha$ and $\beta$ is typically non-linear.

The conclusion is that the p-value of (approximately) 3.7% is too low to be considered safe. Hence, we reject the Null Hypothesis and accept the Alternative which states that the Kurtosis of the data is different from 3 (i.e. the Kurtosis of the Normal Distribution).

133.1.3 Task 3

Examine the computation presented in the Computation tab and explain, in your own words, the following aspects:

What type of Hypothesis Test was performed?
Write down the Null and Alternative Hypothesis.
Interpret the confidence interval and the p-value. Do we have to reject the Null Hypothesis?
What is the lowest type I error we could choose which would lead us to reject H$_0$?

Interactive Shiny app (click to load).

Open in new tab

We answer each question in sequence:

What type of Hypothesis Test was performed?
From the output it can be deduced that the computation involves the “Paired Two Sample t-Test” with a chosen type I error of 3%.
Write down the Null and Alternative Hypothesis.
The Null and Alternative Hypotheses can formulated as follows:
\[\begin{cases} \text{H}_0: \mu_1 - \mu_2 = \mu_0 = 2 \\ \text{H}_A: \mu_1 - \mu_2 \neq \mu_0 = 2 \end{cases}\]

which is a two-sided hypothesis.
Interpret the confidence interval and the p-value. Do we have to reject the Null Hypothesis?
We need to interpret the two-sided confidence interval. The Null value is contained in the two-sided interval ($\mu_0 = 2 \in [1.6, 2.1]$) which implies that we fail to reject the Null Hypothesis.
What is the lowest type I error we could choose which would lead us to reject H$_0$?
If we choose a type I error $\alpha = 0.2637 + \delta$ (where $\delta$ is an arbitrarily small number) then we are allowed to reject the Null Hypothesis.

133.1.4 Task 4

For the previous task, examine the results from the Wilcoxon Signed-Rank Test. What are your conclusions?

The p-value is (approximately) 0.052 which is low but still higher than the chosen type I error level of 3%. Hence, we fail to reject the Null Hypothesis which states that the differences between pairs are symmetrically distributed around $\mu_0 = 2$.

133.1.5 Task 5

Examine the analysis presented in the Computation tab and explain the conclusions.

Note: you have to make sure that “violin plot (within variance)” is selected to see the proper analysis. Be patient, it may take a while to complete the computation.

Interactive Shiny app (click to load).

Open in new tab

The computation represents an alternative to the classical approach because it combines Hypothesis Testing with Explorative Data Analysis (based on box and violin plots). Both the parametric and non-parametric tests, reject the Null Hypothesis of a zero difference between both variables.

The analysis compares the mean ranks of IM.Know and IM.Accomplishment from the AMS dataset (see also Tasks 3 and 4). The only difference is the Null value: here we have a zero Null Hypothesis whereas in Tasks 3 and 4 we employed a value of 2.

133.1.6 Task 6

Write an R script that performs a two-sided test of the Arithmetic Mean for two variables of unequal length: $X \sim \text{N}(3.7, 2.1)$ and $Y \sim \text{U}(12, 20)$. Use a significance level of 1%.

set.seed(123)
X = rnorm(n = 150, mean = 3.7, sd = 2.1)
Y = runif(n = 200, min = 12, max = 20)
t.test(X, Y, var.equal = F, paired = F, conf.level = 0.99)


    Welch Two Sample t-test

data:  X and Y
t = -52.988, df = 341.72, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
99 percent confidence interval:
 -12.86089 -11.66210
sample estimates:
mean of x mean of y 
 3.648838 15.910332

It is implied in the question that we have to use an Unpaired Two Sample t-Test because the samples have an unequal length. We use the Welch test (i.e. var.equal = F) because this will allow for unequal variances.

133.1.7 Task 7

Write an R script which performs a Wilcoxon test for paired samples. The type I error is 2%.

set.seed(123)
X = rnorm(n = 150, mean = 3, sd = 1)
Y = rnorm(n = 150, mean = 4, sd = 1) # make sure both samples have equal length!
wilcox.test(X, Y, paired = T, conf.int = T, conf.level = 0.98)


    Wilcoxon signed rank test with continuity correction

data:  X and Y
V = 1584, p-value = 1.99e-14
alternative hypothesis: true location shift is not equal to 0
98 percent confidence interval:
 -1.4049462 -0.8354664
sample estimates:
(pseudo)median 
     -1.120432

133.1.8 Task 8

Suppose we have two paired samples and we wish to test the mean difference of both groups. We use two different methods to do this:

First we compute the differences of each pair such that a new data set is obtained. The new data set is analyzed with the Bootstrap Plot for Central Tendency.
The second method is the Bayesian Two Sample Test which is applied to the paired samples.

Both methods use simulation algorithms to obtain their results. Without going into the details of these algorithms (they are not relevant in this question), explain in your own words the conceptual difference between both approaches. Do you think that the conclusion of both methods will be different? Why?

In principle, both methods could provide us with valid results. There are, however, fundamental differences between both approaches which have important consequences.

The first approach is of a “frequentist” nature because the bootstrap method treats the sample as if it were a population and it draws (repeatedly) random samples with replacement. The resulting samples are used to compute the frequency distributions that are associated with the measures of Central Tendency of interest. This allows us to obtain confidence intervals which can be used for Hypothesis Testing. The rationale of this approach is that one relies exclusively on the observed sample data to test the hypothesis. There is no prior knowledge involved.

The bootstrap method is sometimes believed to be “objective” because it is only based on actual data -- the researcher’s beliefs have no impact on the result. This is, however, not always true because it is possible that the original sample (which is used to bootstrap new samples) is not representative for the population. In this sense, the bootstrap method makes implicit assumptions about the quality of the sample.

The second approach is a “Bayesian” method which combines prior information (either based on actual data from previous studies or on expert knowledge) with the actual data that is observed. At the heart of this approach is Bayes’ Theorem which explains how the posterior distribution can be computed when the prior distribution and the data-based likelihood have been obtained.

The obvious advantage of the Bayesian method is that the sample does not need to be representative. In addition, the researcher is able to reconcile qualitative information (from the expert) with quantitative information (from the sample). On the other hand, the Bayesian method may also fail miserably if the expert knowledge is somehow prejudiced or biased.

The bottom line is that it is up to you to decide which approach is best for your research!

133.1.9 Task 9

Reconsider the example from Section 69.6 (click on the Computation tab to see the analysis) and suppose you wish to test differences between the variables with other types of Hypothesis Tests. Which tests would you use? What are the pros and cons?

Interactive Shiny app (click to load).

Open in new tab

The data is structured in wide format implying that we have to investigate the within variance of paired samples. Theoretically speaking, we could have arranged unpaired samples in a wide format but then we would see missing observations in adjacent columns (when you click on the Input tab you can see that there are no missing values for the four selected variables).

When we select the “violin plot (within variance)” option, however, the output displays an uniformative error message. The reason for this is that the software has been limited to compare only two variables in this case (we did this on purpose to make sure that the computation doesn’t take too long). So we also have to reduce the number of variables to only two to make this work.

In addition, we should select “nonparametric” in the “Type of ggstatsplot” drop down menu because we are dealing with data that has been coded as a 7 point Likert scale (the Wilcoxon uses ranks so it is permitted to use ordinal data).

133.1.10 Task 10

Write an R script that tests the association between the variables mtcars$gear and mtcars$cyl.

(mytable = table(mtcars$gear, mtcars$cyl))
chisq.test(mytable, simulate.p.value = T)

   
     4  6  8
  3  1  2 12
  4  8  4  0
  5  2  1  2

    Pearson's Chi-squared test with simulated p-value (based on 2000
    replicates)

data:  mytable
X-squared = 18.036, df = NA, p-value = 0.001499

The dataset mtcars is available by default. The variables are best treated as categorical because they have a discrete distribution. In other words, it makes sense to use the Chi-squared test for this problem. Note that we have to use simulate.p.value = T to derive simulated p-values because the expected cell frequencies are rather low.

133.1.11 Task 11

Use the Curry dataset that can be downloaded from https://bookmark.wessa.net/s/rfc-apps/curry.csv to examine the effect of the Curry variable on the Rate variable (= response). Use an R script to complete this task.

library(car)
x = read.csv(file = "https://bookmark.wessa.net/s/rfc-apps/curry.csv")
df = data.frame(Rate = x$Rate, Curry = as.factor(x$Curry))
(lmxdf<-lm(Rate ~ Curry, data = df))
aov.df<-aov(lmxdf)
(anova.df<-anova(lmxdf) )
(TukeyHSD(aov.df))
(leveneTest(lmxdf))


Call:
lm(formula = Rate ~ Curry, data = df)

Coefficients:
(Intercept)    Currymild  
      6.125       -2.400  

Analysis of Variance Table

Response: Rate
          Df Sum Sq Mean Sq F value    Pr(>F)    
Curry      1 115.20  115.20  27.366 1.378e-06 ***
Residuals 78 328.35    4.21                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = lmxdf)

$Curry
         diff       lwr       upr   p adj
mild-hot -2.4 -3.313364 -1.486636 1.4e-06

Levene's Test for Homogeneity of Variance (center = median)
      Df F value    Pr(>F)    
group  1  12.086 0.0008333 ***
      78                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

133.1.12 Task 12

Based on the R code of the previous Task, include smoking and interaction effects.

library(car)
x = read.csv(file = "https://bookmark.wessa.net/s/rfc-apps/curry.csv")
df = data.frame(Rate = x$Rate, Curry = as.factor(x$Curry), Status = as.factor(x$Status))
(lmxdf<-lm(Rate ~ Curry*Status, data = df))
aov.df<-aov(lmxdf)
(anova.df<-anova(lmxdf) )
(TukeyHSD(aov.df))
(leveneTest(lmxdf))


Call:
lm(formula = Rate ~ Curry * Status, data = df)

Coefficients:
        (Intercept)            Currymild            StatusSMK  
               8.10                -4.45                -3.95  
Currymild:StatusSMK  
               4.10  

Analysis of Variance Table

Response: Rate
             Df Sum Sq Mean Sq F value    Pr(>F)    
Curry         1 115.20 115.200  50.873 4.937e-10 ***
Status        1  72.20  72.200  31.884 2.698e-07 ***
Curry:Status  1  84.05  84.050  37.117 4.248e-08 ***
Residuals    76 172.10   2.264                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = lmxdf)

$Curry
         diff       lwr       upr p adj
mild-hot -2.4 -3.070172 -1.729828     0

$Status
       diff       lwr       upr p adj
SMK-NS -1.9 -2.570172 -1.229828 3e-07

$`Curry:Status`
                  diff        lwr        upr     p adj
mild:NS-hot:NS   -4.45 -5.7000003 -3.1999997 0.0000000
hot:SMK-hot:NS   -3.95 -5.2000003 -2.6999997 0.0000000
mild:SMK-hot:NS  -4.30 -5.5500003 -3.0499997 0.0000000
hot:SMK-mild:NS   0.50 -0.7500003  1.7500003 0.7202343
mild:SMK-mild:NS  0.15 -1.1000003  1.4000003 0.9890696
mild:SMK-hot:SMK -0.35 -1.6000003  0.9000003 0.8824897

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  3  0.2499 0.8612
      76