16 Hypergeometric Distribution

16.1 Definition

Suppose a finite population has size \(N\), with \(M\) “success” items and \(N-M\) “failure” items. If we draw \(n\) items without replacement, and let \(X\) be the number of successes in the sample, then:

\[ X \sim \text{Hypergeom}(N,M,n) \]

with probability mass function

\[ \text{P}(X = x) = \frac{\binom{M}{x}\binom{N-M}{n-x}}{\binom{N}{n}} \]

for

\[ \max(0, n-(N-M)) \le x \le \min(n,M). \]

Parameter summary:

Symbol	Meaning
\(N\)	population size
\(M\)	number of successes in the population
\(n\)	sample size (draws without replacement)
\(X\)	successes observed in the sample

16.2 Mean

\[ \text{E}(X) = n\frac{M}{N} \]

16.3 Variance

\[ \text{V}(X) = n\frac{M}{N}\left(1-\frac{M}{N}\right)\frac{N-n}{N-1} \]

The factor \(\frac{N-n}{N-1}\) is the finite-population correction (FPC). It is strictly less than 1 when \(n>1\), which reduces variance relative to a binomial model with the same nominal success probability \(M/N\). Intuitively, sampling without replacement induces negative dependence among draws.

16.4 Mode

At least one mode is

\[ \text{Mo}(X)=\left\lfloor \frac{(n+1)(M+1)}{N+2}\right\rfloor. \]

If \(\frac{(n+1)(M+1)}{N+2}\) is an integer, two adjacent values can be modes.

16.5 Median

No simple closed-form expression is available in general; the median is typically obtained numerically from the CDF.

16.6 Moment Generating Function

Using the finite support of \(X\), the MGF can be written as

\[ M_X(t)=\sum_{x=\max(0,n-(N-M))}^{\min(n,M)} e^{tx}\,\text{P}(X=x). \]

16.7 Coefficient of Skewness

\[ g_1=\frac{(N-2M)(N-2n)\sqrt{N-1}}{(N-2)\sqrt{nM(N-M)(N-n)}}. \]

16.8 Coefficient of Kurtosis

\[ g_2 = 3 + \frac{(N-1)N^2\left[N(N+1)-6M(N-M)-6n(N-n)\right] + 6nM(N-M)(N-n)(5N-6)}{nM(N-M)(N-n)(N-2)(N-3)}. \]

The corresponding excess kurtosis is

\[ g_2 - 3 = \frac{(N-1)N^2\left[N(N+1)-6M(N-M)-6n(N-n)\right] + 6nM(N-M)(N-n)(5N-6)}{nM(N-M)(N-n)(N-2)(N-3)}. \]

16.9 Purpose

The hypergeometric distribution is the default model for sampling without replacement:

Audit and compliance sampling: expected number of problematic records in a fixed audit sample.
Quality control: defect counts when inspecting items from a finite lot.
Inventory and logistics checks: category counts from finite stock pulls.
Lot acceptance sampling: classical quality-control acceptance/rejection decisions for finite lots.
Relation to binomial: when the sample fraction \(n/N\) is small (rule of thumb: \(n/N < 0.05\)), hypergeometric probabilities are often close to binomial probabilities (Chapter 13). Otherwise, hypergeometric is the exact finite-population model and should be preferred.

16.10 R Module

The Hypergeometric Probabilities app is available in the handbook menu:

Distributions / Hypergeometric Probabilities

It is also accessible directly at:

https://shiny.wessa.net/hypergeometric/

16.11 Business Example: Internal Audit Sampling

An organization has \(N = 1200\) procurement records for a quarter. Based on risk profiling, \(M = 90\) are flagged as high-risk. An internal audit samples \(n = 80\) records without replacement.

Let \(X\) be the number of high-risk records in the sample. A key escalation metric is:

\[ \text{P}(X \ge 10) \]

N <- 1200
M <- 90
n <- 80

cat("P(X >= 10) =", 1 - phyper(9, m = M, n = N - M, k = n), "\n")
cat("P(5 <= X <= 12) =", phyper(12, m = M, n = N - M, k = n) - phyper(4, m = M, n = N - M, k = n), "\n")

P(X >= 10) = 0.06893187 
P(5 <= X <= 12) = 0.7297473

You can reproduce this setup with the preconfigured app below:

Interactive Shiny app (click to load).

Open in new tab

16.12 Additional Academic Example: Ecology Field Sampling

In a conservation study, a habitat has \(N=500\) tagged plants, of which \(M=80\) belong to a rare species.
A team samples \(n=40\) plants without replacement and records the number \(X\) of rare plants.

A monitoring question is:

\[ \text{P}(X \ge 10). \]

N_field <- 500
M_rare <- 80
n_sample <- 40

cat("P(X >= 10) =",
    1 - phyper(9, m = M_rare, n = N_field - M_rare, k = n_sample), "\n")

P(X >= 10) = 0.08627864