112 Decision Thresholds, Alpha, and Confidence Levels

112.1 Purpose of This Chapter

This chapter explains a general principle used throughout this handbook:

statistical thresholds must be chosen by decision context.

You have already seen this idea in earlier chapters without using p-values. In classification and ROC analysis, we choose thresholds by considering false positives, false negatives, sensitivity, specificity, and decision costs. Hypothesis testing uses the same logic. The difference is that the threshold is often written as a significance level \(\alpha\) (or, equivalently, as \(1 - \alpha\), the confidence level).

This chapter makes that connection explicit and provides a consistent framework for choosing and reporting thresholds across:

diagnostic checks,
confirmatory hypothesis tests,
exploratory/selection workflows,
equivalence testing,
and confidence-interval interpretation.

It also provides a practical reporting template (see Reporting Template) and examples that show why the same threshold should not be used automatically in every situation.

112.2 Why the One-Alpha-for-All Convention Is Wrong

The one-alpha-for-all convention (usually 5%) is wrong as a general methodological rule.

It is wrong because it treats \(\alpha\) as a ritual constant instead of a decision threshold. A threshold that is appropriate for one purpose may be inappropriate for another. In particular, a threshold for a confirmatory research claim should not automatically be reused for a diagnostic assumption check, and neither should automatically be reused for an equivalence decision.

A fixed alpha convention becomes especially misleading when it is applied mechanically and interpreted without regard to:

the role of the test,
the meaning of \(H_0\) in context,
the relative costs of Type I and Type II errors,
effect size and practical significance,
and power / sample size.

This chapter does not reject the use of \(\alpha\). It rejects the idea that a single alpha value can serve as a universal methodological principle.¹

112.3 Connecting to Earlier Threshold Concepts

In earlier chapters, thresholds already appeared in practical decision contexts:

classification thresholds (strict vs permissive),
sensitivity/specificity trade-offs,
false alarms vs missed detections,
pay-off / cost matrices,
ROC-based threshold selection.

See, for example:

Sensitivity and Specificity (Chapter 8),
Confusion Matrix (Chapter 59),
ROC Analysis (Chapter 60),
Naive Bayes Classifier (Chapter 9).

Hypothesis testing uses the same underlying logic:

a classification threshold decides whether a score is labeled positive,
a significance threshold \(\alpha\) decides whether evidence is strong enough to reject \(H_0\).

In both cases, changing the threshold changes error rates.

112.4 Alpha as a Decision Threshold

In hypothesis testing, \(\alpha\) controls the probability of a Type I error under the model assumptions. But this does not mean the same \(\alpha\) is appropriate for every task.

The wrong question is:

“What is the standard alpha?”

The right question is:

“What decision is this test supporting, and what error is more costly here?”

112.5 Bayesian Decision Framing in This Handbook

The same threshold logic can be expressed directly in Bayesian form:

define a practical claim (for example, \(p < p_0\) or \(p > p_0\)),
compute \(P(\text{claim}\mid\text{data})\),
choose a posterior decision threshold by context,
and report posterior decision error probability.

This often feels more natural in applied decision settings because the threshold is attached to an explicit probability of the claim rather than only to long-run Type I error control.

See Chapter 113 for the full workflow with:

posterior-threshold decisions,
Bayes factors,
and threshold-dependent Bayes decision error interpretation.

112.6 Roles of Tests and Diagnostics

We use four roles in this handbook:

Confirmatory
Diagnostic
Exploratory / Selection
Equivalence

112.6.1 Confirmatory

Confirmatory tests are tied to the main research claim.

Examples include:

comparing means between groups,
testing treatment effects,
testing whether a parameter differs from a benchmark/null value.

In confirmatory settings, false positives are often costly (incorrect claims, poor decisions, publication bias), so stricter thresholds are often appropriate.

Typical range (context-dependent): \(\alpha = 0.01\) to \(0.05\)

112.6.2 Diagnostic

Diagnostic tests are used to check assumptions or screen for potential problems.

Examples include:

equal variance checks,
distributional shape checks,
normality diagnostics,
residual assumption checks,
autocorrelation diagnostics.

In many diagnostic settings, \(H_0\) corresponds to the “good” state we hope is approximately true (e.g. equal variances, no serious shape difference). A low-powered diagnostic combined with a low alpha can create false reassurance (Type II error).

For this reason, a higher diagnostic alpha is often reasonable.

Typical range (context-dependent): \(\alpha = 0.10\) to \(0.20\)

This does not prove assumptions. It reduces the risk of missing relevant violations.

112.6.3 Exploratory / Selection

In exploratory/selection settings, the goal is not primarily to make a single confirmatory claim. Instead, the goal is to screen, rank, or choose among alternatives.

Examples include:

selecting variables/features,
selecting a model among candidates,
choosing a transformation,
selecting an asset for a specific distributional property,
screening data for candidates that deserve follow-up analysis.

This role is broader than hypothesis testing alone. In exploratory/selection workflows, thresholds may be applied to quantities such as:

p-values,
information criteria (e.g. AIC/BIC),
cross-validated error,
AUC or classification performance,
posterior probabilities / Bayes factors,
effect-size cutoffs,
graphical diagnostic bands (e.g. QQ or bootstrap intervals).

The central principle remains the same:

the threshold must be chosen by the selection objective and the cost of selection errors.

For example:

if the purpose is broad screening (do not miss promising candidates), a more permissive threshold may be appropriate;
if the purpose is final selection or a strong claim about the selected candidate, a stricter threshold may be appropriate.

This is one reason exploratory/selection workflows often benefit from a two-stage design:

a more permissive screening stage (high sensitivity), followed by
a stricter confirmation stage (higher evidential standard).

When p-values are used in exploratory/selection settings, they should not be treated as if they were automatically confirmatory. If a selected result is later promoted to a substantive claim, that claim should be evaluated in a confirmatory framework with its own pre-specified threshold and interpretation.

Two cautions are especially important in exploratory/selection work:

selection can amplify chance findings (especially when many candidates are screened),
thresholds suitable for screening are often too permissive for final claims.

For this reason, exploratory/selection reporting should clearly state:

the selection objective,
the threshold(s) used,
whether the threshold was for screening or final selection,
whether confirmatory follow-up is still required.

112.6.4 Equivalence

Equivalence testing (e.g. TOST, Two One-Sided Tests, introduced in Chapter 120) reverses the usual burden of proof.

Here, non-rejection of a difference test is not enough. The goal is to show that the effect is small enough to fall within a pre-specified practical margin.

This is a different decision problem and should be treated as such.

112.7 Confidence Levels and Alpha

A confidence level is another way of expressing a threshold.

For the usual two-sided setting:

confidence level \(= 1 - \alpha\)

Examples:

\(\alpha = 0.05 \leftrightarrow 95\%\) confidence interval
\(\alpha = 0.10 \leftrightarrow 90\%\) confidence interval
\(\alpha = 0.20 \leftrightarrow 80\%\) confidence interval

So changing the confidence level is also changing the decision threshold.

This matters in practice:

for some diagnostic uses (screening for violations), a lower confidence band (higher alpha) may be more appropriate,
for confirmatory claims, a higher confidence level (lower alpha) may be more appropriate,
for equivalence testing, the mapping depends on the TOST setup (see the equivalence example below).

This is why the same tool (for example, a QQ plot with confidence bands or a bootstrap interval) may legitimately be used with different confidence levels for different purposes.

112.8 Pre-Specification of Thresholds

A threshold (for example, \(\alpha\), a confidence level, or a diagnostic cutoff) should be chosen based on the role of the analysis and the decision purpose before interpreting the result.

This is a scientific requirement, not a stylistic preference.

In particular:

it is not valid to inspect the result first and then choose the threshold because it produces a preferred conclusion;
it is not valid to justify a threshold only by saying that it is “standard”;
and it is not valid to treat the same threshold as automatically appropriate for confirmatory, diagnostic, exploratory, and equivalence settings.

What is valid is to pre-specify either:

a single threshold (e.g. confirmatory \(\alpha = 0.05\)), or
a pre-declared threshold set / range (e.g. report decisions at 1%, 5%, 10%, and 20%) when the goal is to show sensitivity to the threshold.

The key requirement is that the choice (or range) is justified by the decision context, not by the observed result.

If a threshold is changed after seeing the data, that may still be useful as an exploratory sensitivity analysis, but it should be reported as such and should not be presented as a pre-specified confirmatory decision.

This same guardrail applies to:

significance levels (\(\alpha\)),
confidence levels (\(1-\alpha\)),
equivalence margins and TOST thresholds,
classifier thresholds,
and diagnostic cutoffs in plots or screening tools.

112.9 Non-Rejection Does Not Prove the Null

Failing to reject \(H_0\) does not prove \(H_0\).

This matters most in diagnostic testing, where \(H_0\) often represents an assumption we would like to hold (approximately).

A non-significant diagnostic result may mean:

the assumption is reasonably compatible with the data, or
the test had insufficient power to detect a relevant violation.

This is one reason diagnostic tests often use a higher alpha than confirmatory tests: the goal is to reduce false reassurance, not to “prove” assumptions.

Even with a higher diagnostic alpha, non-rejection still does not prove the null. It only means that no strong alarm was triggered at the chosen threshold.

112.10 Effect Size, Practical Significance, and Power

Threshold choice (for example, choosing \(\alpha\) or a confidence level) is important, but it is never sufficient on its own.

A statistically sound interpretation requires three things together:

a threshold decision (alpha / confidence level / cutoff),
an effect estimate and its practical meaning,
and an assessment of how informative the data are (power / sample size).

112.10.1 Effect Size and Practical Significance

A p-value answers a narrow question about compatibility with \(H_0\) under the model assumptions. It does not answer:

whether the effect is large enough to matter in practice,
whether the difference is meaningful for the decision,
or whether the observed magnitude is useful in the real application.

With large enough samples, very small and practically unimportant effects can become statistically significant. Conversely, with small samples, practically important effects may fail to reach a chosen threshold.

For this reason, threshold-based decisions should be interpreted together with:

an effect size estimate (magnitude),
a practical significance statement (what magnitude matters here),
and an interval estimate (uncertainty/precision).

The practical threshold for a meaningful effect is a substantive decision, not a by-product of the p-value.

112.10.2 Power and Sample Size

The meaning of a non-significant result depends strongly on power.

A non-significant result can mean:

the effect is small or absent, or
the data are too limited to detect an effect of practical importance.

This is why threshold choice cannot be separated from sample size and power.

A strict confirmatory alpha may be reasonable, but only if the study is sufficiently informative to detect effects that matter. Otherwise, a strict threshold can produce many inconclusive “non-significant” outcomes.

112.10.3 Why This Matters Especially for Diagnostics

The interaction between alpha and power is especially important in diagnostic tests.

In many diagnostic settings:

\(H_0\) represents a condition we hope is approximately true,
but failing to reject \(H_0\) does not prove the assumption,
and low power can produce false reassurance.

This is one reason a higher diagnostic alpha (e.g. 10% to 20%) can be reasonable: it reduces the risk of missing relevant violations.

However, increasing alpha is not a substitute for judgment. It should be combined with:

graphical diagnostics,
effect-size-like summaries (magnitude of deviation),
sample-size awareness,
and, when appropriate, robust methods.

112.10.4 Practical Implication for Reporting

For a structured reporting format, see Reporting Template.

112.11 Robust Methods and Fragile Diagnostics

In many workflows, diagnostic tests are used to assess assumptions before applying a main inferential method. This can be useful, but it can also lead to poor practice when diagnostics are treated as strict gatekeepers.

A common example is the use of a variance-equality test as a gatekeeper before comparing two means. If the diagnostic test is non-significant, some analysts proceed with the classical equal-variance t-test; if it is significant, they switch methods. This procedure is often unstable because the diagnostic test itself may be low-powered or sensitive to departures from its own assumptions.

For this reason, it is often preferable to use a robust method by default when one is available.

Examples include:

Welch’s two-sample t-test instead of the equal-variance t-test,
rank-based methods (e.g. Wilcoxon-type procedures) when strong distributional assumptions are doubtful,
robust estimators or transformations when appropriate to the scientific objective.

Using a robust method can reduce dependence on fragile diagnostics, but it does not eliminate threshold reasoning.

Threshold decisions still remain, for example:

the confirmatory threshold for the main inferential claim,
the threshold for any supporting diagnostics that are still reported,
the threshold for equivalence decisions (if applicable),
and practical thresholds for effect-size interpretation.

In other words:

robust methods reduce the risk of choosing the wrong method because a diagnostic gatekeeper was uninformative,
but they do not remove the need to justify thresholds by decision context.

112.11.1 Practical Guidance

When a robust method is available and scientifically appropriate:

use the robust method for the main confirmatory analysis (with a clearly justified confirmatory threshold),
treat assumption diagnostics as supporting evidence, not automatic gatekeepers,
report diagnostic results with their role clearly labeled,
avoid interpreting diagnostic non-rejection as proof that assumptions hold exactly.

This approach is often more stable, more transparent, and more consistent with the decision-threshold framework of this chapter.

112.12 Worked Example: One Tool, Two Purposes (QQ Bands and Bootstrap)

This example uses the same data and the same graphical tools for two different purposes:

diagnostic screening (assumption check),
exploratory/selection (screening for heavy tails / excess kurtosis).

The tools are:

the ML fitting / QQ app: https://shiny.wessa.net/fitdistrnorm/
the bootstrap central-tendency app: https://shiny.wessa.net/bootstrap/

The key point is that the observed data stay the same, but the threshold (confidence level / alpha) changes because the decision purpose changes.

112.12.1 The Tool

Suppose we inspect a QQ plot with confidence bands for a univariate dataset using the ML fitting / QQ workflow (the same style of plot produced by car::qqPlot()).

Let the confidence bands be interpreted as a threshold device for visual diagnostics:

narrower bands (higher \(\alpha\), lower confidence) are more sensitive and raise more alarms,
wider bands (lower \(\alpha\), higher confidence) are more conservative and raise fewer alarms.

This is the same threshold logic used earlier in classification: a stricter or more permissive threshold changes the balance between false alarms and missed detections.

112.12.2 Dataset and Setup (same in both scenarios)

Use the same univariate series in both runs. For a concrete example, use the preloaded DAX log returns (EuStockMarkets) stored dataset in the embedded app below.

For the QQ workflow in the ML fitting app:

Density Function = normal
Trimming = 0
keep the same Sample Range
inspect the QQ plot against the Normal distribution

For the bootstrap workflow (same series):

use the same sample range
keep trimming fixed
inspect the bootstrap distributions and notched boxplots of central tendency

In the embedded example below (DAX daily log returns), the sample size is:

\(n = 300\)

The following app is preloaded for Scenario A (diagnostic context, 82% QQ bands, normal reference, no trimming).
To reproduce Scenario B, keep the same data/settings and change QQ band confidence to 0.95.

Interactive Shiny app (click to load).

Open in new tab

112.12.3 Follow-up Task: Contrast Confirmatory and Diagnostic Settings

Because the full untrimmed series has very heavy tails, a second task helps isolate the threshold effect more clearly.

Using the same embedded app and the same data, set:

Sample Range = [1, 101]
Trimming = 0.04
Density Function = normal

Then compare two threshold contexts on this filtered sample:

Confirmatory setting: QQ band confidence = 0.992 (99.2%)
Diagnostic setting: QQ band confidence = 0.802 (80.2%)

Observed behavior on this setup:

After trimming and shortening the sample, the effective size is about n = 91
Tail severity drops strongly (excess kurtosis moves from very large values in the untrimmed case to around -0.42 on this filtered subset)
At 99.2%, essentially no points fall outside the QQ bands
At 80.2%, many points are flagged (here: 47 outside, mostly in the upper tail)

Interpretation:

This is not a contradiction; it is the intended lesson.
Trimming + shorter window changes the data uncertainty structure, and changing the QQ threshold changes how sensitive the diagnostic is.
A confirmatory setting can look “acceptable” while a diagnostic setting still flags deviations worth further investigation.

112.12.4 Scenario A: Diagnostic Normality Screening (Assumption Check)

Suppose the QQ plot is used as a diagnostic before applying a method that is sensitive to non-normality (or before deciding whether a transformation or a robust alternative should be considered).

Role: Diagnostic
Purpose: Screen for meaningful departures from Normality
Typical null-like idea: “Normality is a reasonable approximation”
Main risk: False reassurance (missing a relevant violation)

In this setting, a low-powered diagnostic with a very strict threshold can be misleading. We may fail to detect an important violation simply because the diagnostic is not sensitive enough.

Therefore, it can be reasonable to use a higher diagnostic alpha (equivalently, a lower confidence level for the QQ bands), for example:

diagnostic QQ bands at 80% to 90% confidence (i.e. \(\alpha \approx 0.20\) to \(0.10\))

Observed diagnostic summary (DAX log returns; 82% QQ bands):

confidence level: 82%
number/pattern of points outside the bands: 121 points outside (17 below, 104 above); departures are concentrated in the tails, with much stronger upper-tail deviation
visible pattern (tails / curvature / skew): clear nonlinearity with tail departures, especially on the upper tail, consistent with strong non-Normal behavior
supporting summary (e.g. skewness / kurtosis / bootstrap shape): skewness ≈ -3.131 and excess kurtosis ≈ 38.997 (kurtosis ≈ 41.997)

Diagnostic interpretation:

At an 82% diagnostic threshold, the QQ plot does raise a warning about meaningful deviation from Normality.
This is a screening conclusion, not proof.

112.12.5 Scenario B: Exploratory / Selection (Evidence of Heavy Tails)

Now use the same QQ plot for a different purpose.

Suppose we are screening financial assets and we are specifically interested in assets whose log returns show evidence of heavy tails / excess kurtosis (because that is relevant to a downstream strategy or risk model).

Role: Exploratory / Selection (possibly confirmatory later)
Purpose: Select an asset with convincing evidence of tail behavior
Typical null-like idea: “The returns are approximately Normal”
Main risk: False selection (claiming heavy tails too easily)

Here the decision cost is different. A false alarm may cause us to select an asset for the wrong reason. Therefore, we may prefer a stricter threshold (equivalently, a higher confidence level for the QQ bands), for example:

selection QQ bands at 95% to 99% confidence (i.e. \(\alpha \approx 0.05\) to \(0.01\))

Now the evidence required before claiming meaningful tail departures is stronger.

Observed selection summary (same DAX log returns; 95% QQ bands):

confidence level: 95%
number/pattern of points outside the bands: 62 points outside (7 below, 55 above); fewer alarms than at 82%, but substantial tail departures remain
visible tail behavior relative to 82% case: the qualitative tail pattern remains, with fewer flagged points under wider bands; the signal is more conservative but still strong
supporting summary (e.g. bootstrap stability of tail-sensitive statistics): the same series still shows strong asymmetry and heavy tails (skewness ≈ -3.131, excess kurtosis ≈ 38.997), supporting a heavy-tail interpretation

Selection interpretation:

If tail departures remain clear at 95% bands, the evidence for heavy tails is more convincing for selection.
If the apparent departures disappear at 95% bands, the evidence may be too weak for the selection purpose.

112.12.6 The Same Data, Different Thresholds, Different Decisions

This is not a contradiction. It is correct statistical reasoning.

The threshold should follow the decision purpose.

What would be wrong is to use the same threshold automatically in both scenarios merely because “5% is standard.”

112.12.7 Why This Example Matters

This example shows three important principles at once:

Alpha (or confidence level) is a decision threshold
Threshold choice depends on role
Non-rejection is not proof, especially in diagnostics

112.12.8 Practical Reporting Pattern for This Example

When reporting results from QQ-band diagnostics (or related bootstrap interval tools), report:

the role of the analysis (diagnostic vs selection),
the confidence level (or \(\alpha\)) and why it was chosen,
the pattern of departures (center, tails, skew, curvature),
any supporting summaries (e.g. skewness/kurtosis estimates, bootstrap summaries),
and the resulting decision (screening flag, no flag, shortlist candidate, etc.).

112.12.9 Pre-Specification Reminder

The choice of confidence level (or alpha range) should be tied to the role before interpreting the plot.

It is not valid to inspect the QQ plot first and then choose a threshold only because it supports a preferred conclusion.

112.13 Multi-Alpha Reporting Example

A practical way to avoid ritual thinking is to report the same p-value against multiple pre-declared thresholds.

The p-value does not change. The threshold does.

112.13.1 Example (diagnostic tests)

Suppose we have two diagnostic p-values:

centered-sample KS shape diagnostic: \(p = 0.7672\)
variance F-test diagnostic: \(p = 0.2336\)

Both tests are diagnostic (assumption/screening role), not confirmatory claims.

Test (Role = Diagnostic)	p-value	Decision at 1%	Decision at 5%	Decision at 10%	Decision at 20%
Centered-sample KS (shape check)	0.7672	Fail to reject \(H_0\)	Fail to reject \(H_0\)	Fail to reject \(H_0\)	Fail to reject \(H_0\)
F-test (equal variances)	0.2336	Fail to reject \(H_0\)	Fail to reject \(H_0\)	Fail to reject \(H_0\)	Fail to reject \(H_0\)

At first sight, the decisions are unchanged. But the interpretation is not identical.

Failing to reject at 20% is a stronger diagnostic non-alarm than failing only at 5%.
This matters in diagnostics because the main concern is often false reassurance (Type II error), not only false alarms.

In other words, even when the binary decision does not change, the diagnostic strength of the result can change under a role-appropriate threshold framework.

112.13.2 Threshold-Sensitive Example (Illustrative)

Now consider a diagnostic p-value of:

\(p = 0.08\)

Test (Role = Diagnostic)	p-value	Decision at 1%	Decision at 5%	Decision at 10%	Decision at 20%
Diagnostic test (illustrative)	0.08	Fail to reject \(H_0\)	Fail to reject \(H_0\)	Reject \(H_0\)	Reject \(H_0\)

This makes the threshold logic visible:

the data did not change,
the p-value did not change,
the decision threshold changed,
therefore the decision changed.

That is exactly the point of multi-alpha reporting.

112.13.3 Pre-Specification Requirement

The set of thresholds (for example, 1%, 5%, 10%, 20%) should be declared in advance as part of the reporting design.

Multi-alpha reporting is useful as a transparency tool and sensitivity analysis. It is not a license to choose whichever threshold produces a preferred conclusion after the fact.

112.14 Secondary Example: Variance Diagnostic vs Mean Comparison (Welch)

A common mistake in practice is to use the equal-variance F-test as a strict gatekeeper before comparing two means.

This is exactly the kind of situation where the role of the test matters.

112.14.1 The Two Tests Serve Different Purposes

Suppose we compare two independent groups. There are two distinct questions:

Main research question (confirmatory): Do the group means differ?
Supporting diagnostic question: Is the equal-variance assumption reasonable?

These are not the same decision problem, so they do not need the same threshold.

112.14.2 Role-Based Thresholds

Variance F-test
- Role: Diagnostic
- Purpose: Screen for evidence against equal variances
- Typical concern: Avoid false reassurance
- Reasonable choice: a higher diagnostic alpha (e.g. 10% to 20%)
Mean comparison test
- Role: Confirmatory
- Purpose: Support the main substantive claim
- Typical concern: Avoid false positive claims
- Reasonable choice: a stricter confirmatory alpha (e.g. 1% to 5%)

112.14.3 Example Interpretation

Suppose the equal-variance F-test yields:

p-value = 0.2336

Then:

at a diagnostic alpha of 20%, we still fail to reject equality of variances,
so the sample does not provide strong diagnostic evidence of unequal variances.

This is a stronger diagnostic non-alarm than failing only at 5%, but it is still not proof that the population variances are equal.

112.14.4 Why Welch Often Remains the Better Default

Even when the variance diagnostic does not raise an alarm, it is often reasonable to use Welch’s two-sample test for the mean comparison because:

it does not require the equal-variance assumption,
it is robust in many practical settings,
and it reduces dependence on a low-power diagnostic gatekeeping step.

This illustrates an important principle:

Robust methods can reduce dependence on fragile diagnostics, but they do not eliminate threshold reasoning.

112.14.5 Recommended Reporting Style (This Case)

Report the two decisions separately:

Diagnostic (variance check):
“At a diagnostic significance level of 20%, the variance F-test (p = 0.2336) does not provide strong evidence against equal variances. This is not proof of equality.”
Confirmatory (mean comparison):
“The main group comparison was conducted using Welch’s two-sample test at a confirmatory significance level of 5%, with effect size and confidence interval reported.”

This avoids the common error of treating the variance diagnostic as a strict gatekeeper.

112.15 Secondary Example: Equivalence Testing (TOST)

This section is a threshold-logic illustration, not a full tutorial on the TOST procedure.

Its purpose is to show why equivalence is a different decision role from ordinary difference testing, and why threshold choice must follow that role. The full method, assumptions, implementation details, and interpretation are presented in the dedicated TOST chapter (Chapter 120).

Equivalence testing is one of the clearest examples of why threshold choice depends on the decision problem.

112.15.1 Why a Usual Non-Significant Difference Test Is Not Enough

Suppose a usual two-sided difference test yields:

p-value = 0.18 (so we fail to reject “no difference”)

This does not show equivalence.

It only shows that the data do not provide strong enough evidence of a difference at the chosen threshold.

A non-significant result can occur because:

the true difference is practically negligible, or
the sample is too small, or
the data are too variable.

Therefore:

Non-rejection of a difference test is not evidence of equivalence.

112.15.2 The Equivalence Question Is Different

In equivalence testing, the goal is not to prove the difference is exactly zero. The goal is to show it is small enough to be practically unimportant.

This requires a pre-specified equivalence margin, for example:

\(\Delta_L = -2\) and \(\Delta_U = 2\)

meaning that differences between -2 and +2 are treated as practically negligible.

112.15.3 Role and Threshold Logic

Role: Equivalence
Purpose: Demonstrate practical similarity within justified bounds
Primary risk: Incorrectly declaring equivalence when the true difference is meaningfully large
Threshold choice: Must be tied to the equivalence design and reported explicitly

This is a different role from confirmatory difference testing and should not inherit its threshold automatically.

112.15.4 TOST Threshold Logic (Illustration Only)

The short summary below is included only to explain the threshold/interval mapping used in this chapter’s equivalence example.

For the full statistical procedure (including assumptions, implementation, and reporting details), see the dedicated TOST chapter (Chapter 120).

TOST evaluates two one-sided null hypotheses:

\(H_{0L}: \delta \le \Delta_L\)
\(H_{0U}: \delta \ge \Delta_U\)

and declares equivalence only if both are rejected.

If the chosen one-sided significance level is \(\alpha\), then the corresponding TOST confidence interval has level:

\(1 - 2\alpha\)

So, for example:

one-sided \(\alpha = 0.05\) corresponds to a 90% TOST CI
one-sided \(\alpha = 0.025\) corresponds to a 95% TOST CI

This is another reason threshold choice must be interpreted in context: the CI mapping differs from the usual two-sided difference-testing convention.

112.15.5 Conceptual Example

Suppose the estimated mean difference is:

\(\hat{\delta} = 0.6\)

with equivalence bounds:

\([-2, 2]\)

and the 90% TOST CI is:

\([-1.1, 1.8]\)

Because the entire TOST CI lies inside the equivalence bounds, the data support equivalence at the chosen threshold.

This is a stronger and more appropriate conclusion than saying:

“the ordinary two-sided test was non-significant.”

112.15.6 Recommended Reporting Style (Equivalence)

Report all of the following:

Role: Equivalence
Equivalence bounds: and why they are practically justified
Chosen one-sided alpha: and the implied TOST CI level
TOST results: both one-sided p-values
Effect estimate: and interval
Conclusion: equivalent / not equivalent (within the stated bounds)

Do not report equivalence solely because an ordinary difference test failed to reject.

For a complete treatment of equivalence testing (including model assumptions, parametric and nonparametric variants, and worked computations), see Chapter 120.

112.16 A 5-Question Rule for Choosing a Threshold

Before computing or interpreting a test result, answer these questions:

What is the role of this procedure?
- Confirmatory / Diagnostic / Exploratory-Selection / Equivalence
What does \(H_0\) represent here?
- a claim to challenge?
- an assumption to screen?
- a benchmark?
- an equivalence boundary?
Which error is more costly in this context?
- Type I (false alarm / false claim)?
- Type II (missed effect / missed violation)?
What effect size is practically important?
This is a substantive judgment, not a statistical one. It requires domain knowledge about what magnitude would change a recommendation, policy, or downstream decision.
Is the sample size large enough for the intended interpretation (power)?

Only then choose the threshold (or threshold range).

112.17 Reporting Template

When reporting a test or threshold-based diagnostic, include:

Role of the procedure (confirmatory / diagnostic / exploratory-selection / equivalence)
Chosen threshold (alpha, confidence level, or cutoff) and why
p-value (if applicable)
Effect estimate and practical significance
Interval estimate (confidence / credible / diagnostic band description as relevant)
Power/sample-size caution (when relevant)
Interpretation wording that matches the role
- especially: avoid treating non-rejection as proof of \(H_0\)

112.18 Summary Principles

The main principles of this chapter are:

Alpha is a decision threshold, not a ritual constant.
The role of the procedure determines threshold logic.
- Confirmatory, diagnostic, exploratory/selection, and equivalence are different decision problems.
Threshold choice depends on error costs.
Non-rejection does not prove the null hypothesis.
- Especially in diagnostics.
Confidence levels are thresholds too.
Effect size, practical significance, and power must be considered together with alpha.
Robust methods can reduce dependence on fragile diagnostics, but do not eliminate threshold reasoning.
Pre-specify the threshold (or threshold range) by role before interpreting results.
Report the threshold and the rationale.
There is no universal correct alpha for all tests.
- There is only a threshold that is more or less appropriate for the decision problem at hand.

These principles apply throughout this handbook, from early classification threshold decisions to diagnostic checks, confirmatory hypothesis tests, equivalence testing, and confidence-interval interpretation.

For foundational guidance on p-values and threshold interpretation, see (Wasserstein and Lazar 2016; Wasserstein, Schirm, and Lazar 2019; Amrhein, Greenland, and McShane 2019).↩︎