111 Hypothesis Testing for Research Purposes

The following sections describe the use of various types of Hypothesis Testing from a practical point of view. For each test, an illustration is provided first at the statistical level, followed by contextual interpretation in the case-study chapters.

When reporting hypothesis tests in university-level work, three components should be presented together:

Statistical significance: report the p-value and state whether H\(_0\) is rejected at the pre-specified \(\alpha\).
Effect size: report magnitude (not only significance), e.g. Cohen’s \(d\) (Cohen 2013), rank-biserial effect, \(\eta^2\), or Cramer’s \(V\) (Cramér 1946) depending on the test.
Uncertainty and precision: report confidence/credible intervals and discuss practical relevance.

Power also matters. A non-significant result can mean either “no meaningful effect” or “insufficient sample size.” Therefore, interpretation should always acknowledge sample size, variability, and design quality.

In short: do not rely on p-values alone. Use p-values, effect sizes, and intervals jointly.

Decision Threshold Choice

This chapter explains what should be reported together (p-value, effect size, interval, and power/sample-size context). The next step is to choose the threshold (e.g. \(\alpha\) or confidence level) by the role of the analysis:

confirmatory,
diagnostic,
exploratory/selection,
equivalence.

The handbook-wide framework for making and reporting that threshold choice is developed in Chapter 112.

111.0.1 Typical Effect Sizes by Test Family

One-sample / paired / unpaired mean tests: Cohen’s \(d\) (Cohen 2013) (or Hedges’ \(g\) (Hedges 1981))
ANOVA: \(\eta^2\) or partial \(\eta^2\)
Chi-squared tests: Cramer’s \(V\) (Cramér 1946) (or \(\phi\) for \(2\times2\) tables)
Rank-based tests: rank-biserial effect size or Cliff’s delta (Cliff 1993)
Correlation tests: \(r\), \(\rho\), or \(\tau\) (the correlation itself is the effect size)

111.0.2 Power and Sample Size

Before data collection, power analysis should align three inputs: target effect size, acceptable type I error \(\alpha\), and desired power (\(1-\beta\)).
After analysis, low-power studies should avoid strong “no effect” conclusions when p-values are non-significant.