Introduction to Probability Distributions

What to Use When — and How

Session roadmap

Observe your data
        ↓
Is it discrete (counts) or continuous (measurements)?
        ↓                               ↓
Discrete: Bernoulli, Binomial,    Continuous: Uniform, Normal
          Geometric, Neg. Binomial,       (t, χ², F → hypothesis testing)
          Hypergeometric, Poisson,
          Multinomial

One thread: identify the process that generated the data, then choose the distribution.

Pre-reading check

Hands up: true or false?

“Counting events in a fixed time window at a constant rate → use the Binomial distribution” false
“The Poisson distribution has the property that its mean equals its variance” true
“The Normal distribution extends to ±∞ and has no finite maximum” true

How to run this: Ask each question verbally, gather hands, THEN click to reveal the answer.

Q1 (false): About half the class will say “true” — Binomial also counts things, so the confusion is natural. The key distinction is the mechanism: Binomial requires a fixed number of trials; Poisson models a rate over a continuous window with no fixed trial count.

Q2 (true): This is the Poisson diagnostic students will use repeatedly. E(X) = V(X) = λ. If someone’s data has mean 4.1 and variance 8.3, Poisson is probably wrong — the data is overdispersed.

Q3 (true): Reinforce that “infinite support” is not a problem — the probability in the tails is negligible. Heights can’t truly be negative, but the probability of N(170, 9) producing a negative value is so close to zero it doesn’t matter for practical use.

Transition: “Three useful checks before we dive in. Now let’s build the vocabulary.”

What is a probability distribution?

A probability distribution is a complete description of all possible outcomes and their probabilities.

Discrete distributions assign probability to countable outcomes: P(X = k) for k = 0, 1, 2, …
Continuous distributions assign probability to intervals: P(a ≤ X ≤ b) = area under f(x)

Two things to always specify: 1. Support — what values can X take? 2. Parameters — which numbers control the shape?

Every distribution in today’s session answers: “what is the data-generating process?”

How to choose a distribution

When you observe…	Distribution	Parameters
One trial, binary outcome	Bernoulli	p
n fixed trials, count successes	Binomial	n, p
Count failures before 1st success	Geometric	p
Count failures before r-th success	Negative Binomial	r, p
Sample n items from N without replacement	Hypergeometric	N, M, n
Count events in fixed time/space at rate λ	Poisson	λ
Count items in K ≥ 3 categories	Multinomial	n, p₁…p_K
Equal probability over interval [a, b]	Uniform	a, b
Measurement or sum of many independent factors	Normal	μ, σ²

This is the most important slide in the deck. Students photograph this.

Walk through each row by reading the “When you observe” column first, then asking “which distribution?” before revealing the row.

Common confusions to address: - Binomial vs Hypergeometric: fixed-n sampling, but WITH replacement (Binomial) vs WITHOUT replacement (Hypergeometric). Binomial: coin flips. Hypergeometric: drawing colored balls from a bag without putting them back. - Binomial vs Poisson: fixed n trials (Binomial) vs. events in a continuous window at a rate (Poisson). Classic test: “Is there a maximum possible count?” If yes, probably Binomial. - Geometric vs Negative Binomial: Geometric is special case of Neg. Binomial with r=1.

Transition: “We’re going to build up to this table row by row. By the end of the session, you’ll be able to justify every entry.”

Point out that t, χ², and F are not in this table — those are test statistics, not data models. We’ll see them briefly at the end.

Bernoulli — the building block

The Bernoulli distribution is the simplest possible: one trial, two outcomes.

\[P(X = 1) = p \quad\quad P(X = 0) = 1 - p\]

X = 1 (“success”), X = 0 (“failure”)
Single parameter: p = P(success)
Mean = p; Variance = p(1−p)

Every Binomial trial is a Bernoulli trial. Binomial(n, p) = sum of n independent Bernoulli(p) variables.

Story: Binomial — hospital births

Setting: A large maternity hospital records an average of 45 births per day. Historically, about 50% of births are recorded as male.

Question: On a given day with 45 births, what is the probability of at most 27 males?

Why Binomial? - Fixed number of trials: n = 45 - Each birth independently classified as male or not: binary outcome - Constant probability: p = 0.5

\[X \sim \text{Binomial}(n = 45,\ p = 0.5)\]

Handbook result: P(X ≤ 27) ≈ 0.9324

Walk through the Binomial conditions explicitly: 1. Fixed n — “we know there are exactly 45 births today” 2. Independence — “whether birth #1 is male doesn’t affect birth #2” (approximation: births in practice are not perfectly independent, e.g. twins, but close enough) 3. Binary outcome — “male or not male” 4. Constant p — “same p = 0.5 applies to every birth”

Why P(X ≤ 27)? Because 27 = 60% of 45. In the Law of Large Numbers unit, students simulated which hospital more often showed 60%+ of one sex. This is the exact same calculation — now they can compute it precisely.

Ask: “Before touching the app — intuitively, is P(X ≤ 27) bigger or smaller than 0.5? Why?” Expected answer: bigger than 0.5 because 27 > n/2 = 22.5, so we’re asking about the upper portion of the distribution.

Handbook result: P(X ≤ 27) = pbinom(27, 45, 0.5) ≈ 0.9324.

Binomial — annotated PMF

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]

Symbol	Meaning
n	number of trials
k	number of successes (the outcome we’re asking about)
p	P(success) per trial
(1−p)	P(failure) per trial
\(\binom{n}{k}\)	number of ways to arrange k successes in n trials

Mean = np Variance = np(1−p)

Variance is maximal when p = 0.5 and approaches 0 as p → 0 or p → 1.

Read the PMF aloud in plain English: “The probability of exactly k successes equals: (number of arrangements) times (probability of k successes) times (probability of n−k failures).”

The binomial coefficient is the combinatorial piece — it accounts for all the different orderings of k successes and n−k failures. Students don’t need to compute it by hand; R and the app do it.

Variance insight: When p = 0.5, the variance is at its maximum (np × 0.5 = n/4). When p is near 0 or 1, almost every trial goes the same way, so there’s less variability. This is intuitive: a coin that comes up heads 99% of the time is very predictable.

Connect to the hospital example: n=45, p=0.5 → mean = 22.5, variance = 11.25, SD ≈ 3.35. So 27 is about (27−22.5)/3.35 ≈ 1.34 standard deviations above the mean — which explains why P(X ≤ 27) ≈ 0.93.

Move quickly to the app.

Binomial app

Task

Set n=45, p=0.5. Compute P(X≤27). Does the result match the handbook?
Increase n to 200, keep p=0.5. How does the shape change?
Set p=0.05, n=100. What does the distribution look like? (Compare to Poisson with λ=5.)
Predict: what happens to variance when p=0.5 vs p=0.1? Verify.

Open app in a new tab

Task 1: P(X ≤ 27) with n=45, p=0.5 should return ≈ 0.9324. This matches the handbook.

Visible task card stays minimal. Story reminder: hospital births, n=45, p=0.5. After class, direct students to Exercise 02 for a written follow-up.

Task 2: As n increases (200), the distribution widens but becomes more bell-shaped (Central Limit Theorem approaching Normal). The mean shifts to np = 100. Ask: “What is the standard deviation now?” Expected: √(200 × 0.5 × 0.5) = √50 ≈ 7.07.

Task 3: Binomial(n=100, p=0.05) looks similar to Poisson(λ=5). This is the Poisson approximation — when n is large and p is small, Binomial(n, p) ≈ Poisson(np). Key condition: n ≥ 20 and p ≤ 0.05.

Task 4: Variance = np(1−p). At p=0.5: V = n × 0.25. At p=0.1: V = n × 0.09. Variance at p=0.5 is more than 2.5× larger. Students often underestimate how much shape changes with p.

Debrief question: “In task 3, you found that Binomial and Poisson look similar. When would you prefer to use Poisson in practice?” Expected: when you don’t know n exactly, only the rate λ.

Geometric and Negative Binomial

Geometric — count failures before the first success

\[P(X = k) = (1-p)^k \cdot p \quad\quad k = 0, 1, 2, \ldots\]

Example: a B2B sales rep closes deals with p = 0.18. How many failed calls before the first sale?
Mean = (1−p)/p; Variance = (1−p)/p²
Memoryless: past failures carry no information about future success

Negative Binomial — count failures before the r-th success

\[P(X = k) = \binom{k+r-1}{k}(1-p)^k p^r\]

Example: sales pipeline targeting r = 6 closed deals, p = 0.25. Geometric is the special case r = 1.
Used for overdispersed count data (variance > mean)

Geometric: The memoryless property is the most important conceptual point. In plain English: “If you’ve already had 10 failed calls, your probability of success on the next call is still p = 0.18. The past doesn’t matter — each call is independent.” This distinguishes Geometric from distributions where the history matters.

B2B sales example from the handbook: p = 0.18, E(X) = (1−0.18)/0.18 ≈ 4.6 failed calls before the first success. If this seems high, note that on average the rep fails on 82% of calls.

Negative Binomial: The Negative Binomial is the generalization. Where Geometric asks “how many failures before 1 success?”, Neg. Binomial asks “how many failures before r successes?”

Key practical use: overdispersed count data. If you fit Poisson and the variance is much larger than the mean, try Negative Binomial. It has an extra parameter (r) that lets variance exceed mean.

Apps exist at shiny.wessa.net/geometric and shiny.wessa.net/negativebinomial — point students to Exercise 05 for self-directed exploration of these.

Hypergeometric — sampling without replacement

When to use: sampling n items from a finite population of N, where M items have the property of interest, without replacement.

\[P(X = k) = \frac{\binom{M}{k}\binom{N-M}{n-k}}{\binom{N}{n}}\]

Symbol	Meaning
N	total population size
M	number in population with the property
n	sample size
k	number in sample with the property

Example: An auditor samples n = 80 invoices from N = 1,200, of which M = 90 are flagged for errors. What is P(X ≥ 5)?

Key contrast with Binomial: Binomial assumes sampling with replacement (or infinite population). Use Hypergeometric when n/N ≥ 0.1.

Why does replacement matter? Each time you draw without replacement, the composition of the remaining population changes. This is why the probabilities are not constant across draws — which violates the Binomial independence assumption.

Rule of thumb: When n/N < 0.1 (sampling fraction under 10%), Binomial is a good approximation even without replacement. In the audit example, n/N = 80/1200 ≈ 6.7% — borderline, but Hypergeometric is exact.

Audit example: N=1200, M=90, n=80. Defect rate = 90/1200 = 7.5%. Expected defects in sample = 80 × 0.075 = 6. Ask: “If the auditor finds 12 flagged invoices in her sample of 80, is that surprising?” The Hypergeometric distribution gives the answer precisely.

After class, point students to Exercise 05 for the Hypergeometric app and follow-up.

Note: this is the same scenario as the exit problem (factory chips). Students will revisit this.

Transition: “Hypergeometric handles the ‘without replacement’ correction for Binomial. Now let’s look at a distribution that doesn’t even have a fixed n.”

Multinomial — K categories

When to use: n independent trials, each with K ≥ 3 possible outcomes with fixed probabilities p₁, p₂, …, p_K.

\[P(X_1 = k_1, \ldots, X_K = k_K) = \frac{n!}{k_1! \cdots k_K!}\, p_1^{k_1} \cdots p_K^{k_K}\]

where \(k_1 + k_2 + \cdots + k_K = n\) and \(p_1 + p_2 + \cdots + p_K = 1\).

Example: A SaaS support team classifies tickets into K = 4 categories: billing (30%), technical (45%), account (15%), other (10%). In a batch of n = 40 tickets, what is P(billing ≥ 15, technical ≥ 20)?

Each margin is Binomial: \(X_j \sim \text{Binomial}(n, p_j)\).

Binomial is the special case K = 2.

Keep this slide to 60 seconds — the Multinomial is a survey item, not a deep-dive.

Key insight: “If you only care about one category, marginalize — it’s Binomial(n, p_j). The Multinomial is only needed when you care about joint counts across multiple categories simultaneously.”

Why mention this at all? Students encounter it in chi-squared goodness-of-fit tests (where observed vs. expected counts across K categories are compared). Understanding that the Multinomial is the natural distribution for categorical count data motivates why chi-squared test statistics have the form they do.

SaaS example from handbook: A support team sees p_billing=0.30, p_tech=0.45, p_account=0.15, p_other=0.10 over a large historical sample. These proportions are used as the Multinomial probabilities.

Transition: “We’ve covered five discrete distributions. Now let’s meet Poisson — the rate-based counter.”

Story: Poisson — SOC alerts (λ = 6.2)

Setting: A security operations center (SOC) receives an average of λ = 6.2 high-priority alerts per hour. Operations policy requires immediate escalation when 10 or more alerts arrive in one hour.

Question: What is the probability of triggering escalation in any given hour?

Why Poisson? - No fixed number of trials (alerts can arrive at any moment) - Events occur independently at a constant average rate λ - Two alerts cannot arrive at exactly the same instant

\[X \sim \text{Poisson}(\lambda = 6.2)\]

Handbook result: P(X ≥ 10) = 1 − P(X ≤ 9) ≈ 0.1218

Walk through the Poisson conditions: 1. No fixed n — “alerts arrive continuously; there’s no maximum” 2. Independence — “one alert arriving doesn’t trigger another” 3. Constant rate — “λ = 6.2 per hour is stable on average” 4. Rare events — “two alerts at the exact same instant is impossible”

Contrast with Binomial: “Could we model this as Binomial? We’d need to define ‘n trials’ — but how? We could define 360 two-second windows in an hour, p ≈ 6.2/360 ≈ 0.017. That’s Binomial(360, 0.017), which approximates Poisson(6.2) closely. Poisson is just the limit as the windows get infinitely small.”

Escalation interpretation: P(X ≥ 10) ≈ 12.18% means the SOC should expect to trigger the escalation policy in about 1 out of 8 hours — roughly once per working shift. Useful for staffing incident commanders.

Ask: “If the SOC wants escalation to trigger no more than 5% of the time, what should λ be?”

Poisson — annotated PMF

\[P(X = k) = \frac{\lambda^k\, e^{-\lambda}}{k!}\]

Symbol	Meaning
λ	average rate (events per time/space window)
k	observed count (the outcome we’re asking about)
e	Euler’s number ≈ 2.718
k!	k factorial — accounts for event ordering

Mean = λ Variance = λ (equal — this is the Poisson diagnostic)

If your data has variance ≫ mean → consider Negative Binomial (overdispersion).

Read the PMF aloud: “The probability of exactly k events equals: lambda-to-the-k, times e-to-the-negative-lambda, divided by k-factorial.”

The e^{−λ} term is the probability of zero events (base case). The λ^k / k! grows that probability up to the k-th count.

Mean = Variance = λ is the single most important property of the Poisson distribution. This equality is both a derivation result and a diagnostic tool: if you observe count data and find that the sample variance is roughly equal to the sample mean, Poisson is worth trying.

Overdispersion: If variance ≫ mean, data is “overdispersed” — Poisson is too thin-tailed. Common causes: events aren’t truly independent (contagion: one SOC alert triggers others), or the rate λ itself varies over time (non-homogeneous process). Fix: use Negative Binomial.

Underdispersion (rare): variance < mean → consider Binomial or Conway-Maxwell-Poisson.

Example: in the SOC data, if sample mean = 6.2 and sample variance ≈ 6.3, Poisson is a good fit. If variance = 18, something else is happening.

Poisson app

Task

Set λ=6.2. Compute P(X≥10). What escalation rate does this imply?
Increase λ to 20. Compare the shape to Binomial(n=200, p=0.1). What do you notice?
Lower λ to 0.5. What is the most likely count? Why does this make sense?
For each value of λ, verify: does mean ≈ variance in the output?

Open app in a new tab

Task 1: P(X ≥ 10) with λ=6.2 ≈ 0.1218. Escalation rate: about 12.2% of hours, or roughly once every 8 hours. If the SOC operates 24 hours, expect escalation ~3 times per day.

Visible task card stays minimal. Story reminder: SOC alerts with λ = 6.2. After class, direct students to Exercise 03 for a written follow-up.

Task 2: Poisson(λ=20) and Binomial(n=200, p=0.1) will look nearly identical. This demonstrates the Binomial-to-Poisson convergence: λ = np = 200 × 0.1 = 20. Both are centered at 20 with variance ≈ 20.

Task 3: At λ=0.5, the mode is 0 (the most probable count is zero events). Mean = 0.5, so more than half the time no event occurs. This is a rare-event regime — think: number of fatal accidents per day in a single factory.

Task 4: Students should confirm that the app-reported mean and variance both equal λ at each setting. If rounding makes them look slightly different, clarify this is numerical output, not a Poisson violation.

Debrief question: “In task 2, you found Poisson(20) ≈ Binomial(200, 0.1). When would you prefer Binomial over Poisson in practice?” Expected: when you know the number of trials n and the success probability p separately (rather than only the combined rate λ).

From discrete to continuous

Discrete distributions assign probability to individual points: P(X = k).

Continuous distributions assign probability to intervals: P(a ≤ X ≤ b) = area under f(x).

The shift happens when: - Outcomes are measurements, not counts - The possible values form a continuum (every real number in a range)

Key rule: For continuous distributions, P(X = x) = 0 for any single value x.

Probability lives in area, not in height.

This is a bridge slide — 60 seconds. Its purpose is to shift gears, not to teach new content.

The key rule is counterintuitive: Students who calculated P(X = 27) for Binomial will expect to calculate P(X = 5.3) for Normal. Emphasize: for continuous distributions, that probability is zero. The density f(x) has height, but height ≠ probability. You need an interval.

Practical implication: You never ask “what is the probability that a person is exactly 175.0000 cm tall?” You ask “what is the probability that height falls between 174 and 176?” or “between 175 and ∞?”

Analogy: Think of a number line. A discrete distribution puts weights at specific points. A continuous distribution smears those weights into a smooth curve, and probability = the area under that curve.

Transition: “The simplest continuous distribution is one where every value is equally likely. That’s the Uniform.”

Uniform — equal probability

When to use: every value in the interval [a, b] is equally likely.

\[f(x) = \frac{1}{b-a} \quad\quad a \leq x \leq b\]

Symbol	Meaning
a	lower bound
b	upper bound
1/(b−a)	constant density (height of the rectangle)

Mean = (a+b)/2 Variance = (b−a)²/12

Applications: - Random number generation (computers generate U(0,1) first, then transform) - Models for “I have no information about which value in [a, b] is more likely” - Rounding errors: if a value is rounded to the nearest integer, the rounding error ~ U(−0.5, 0.5)

P(c ≤ X ≤ d) = (d − c)/(b − a) — just the fraction of the interval.

The rectangle picture: The Uniform distribution is a flat rectangle. f(x) = 1/(b−a) is just tall enough that the area (width × height = (b−a) × 1/(b−a)) equals 1.

Why does probability = area? Ask students: “If every value in [0, 4] is equally likely, what fraction of the time does X fall in [1, 3]?” Answer: (3−1)/(4−0) = 1/2. That fraction is the probability. Area of rectangle [1,3] under the curve = 2 × (1/4) = 1/2. Same number.

Random number generation: The runif() function in R generates U(0,1) random numbers. Nearly every other distribution is generated by transforming these uniform draws. This makes Uniform the “seed” of all simulation.

Rounding error: This is a nice practical application. If you measure something to the nearest centimeter, the true value could be anywhere within ±0.5 cm of your recorded value — a uniform distribution of errors.

No app for Uniform today — its visual simplicity (a rectangle) doesn’t reward interactive exploration as much. Point students to runif() in R if they want to generate samples.

Story: Normal — sum of many small effects

Why does the Normal distribution appear everywhere?

Human height is the result of hundreds of genetic and environmental factors, each contributing a tiny amount. When many independent small effects add up, the sum follows a Normal distribution — regardless of the original distributions of the individual effects.

This is the Central Limit Theorem: the sample mean \(\bar{X}\) of n independent draws converges to Normal as n → ∞, no matter the original distribution.

Common Normal applications: - Measurement errors (many tiny sources of error summing up) - Test scores (sum of many item responses) - Financial returns over a day (many trades contributing) - Biological measurements: height, weight, blood pressure

The Central Limit Theorem is the most important theorem in statistics. Don’t prove it here — just state the intuition: “add up many small independent contributions, get Normal.”

Why “small”? No single factor dominates. If one gene were responsible for 99% of your height, heights wouldn’t be Normal — they’d cluster around two values (short allele, tall allele).

Practical consequence: The Normal distribution is used not because the true underlying process is exactly Normal, but because it’s a good approximation for sums and averages. Even if individual measurements are slightly skewed, the average over many measurements will be close to Normal.

Contrast with Poisson and Binomial: Those distributions were derived from a specific, concrete data-generating story. The Normal is more of an approximation result — it shows up when many things add together.

Ask: “Can you think of something that is NOT the sum of many independent effects?” Examples: income (one major factor = family wealth, so right-skewed), time until failure (exponential, not sum-of-factors), number of Twitter followers (power law).

Normal — annotated PDF

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}}\, e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]

Symbol	Meaning
μ	mean — center of the distribution
σ	standard deviation — spread (σ > 0)
σ²	variance
(x−μ)/σ	standardized distance from center

Empirical rule (68-95-99.7): - 68% of values fall within μ ± σ - 95% of values fall within μ ± 2σ - 99.7% of values fall within μ ± 3σ

Standard Normal: μ = 0, σ = 1 → denoted Z ~ N(0, 1)

Read the formula aloud: “f of x equals one over sigma-root-two-pi, times e to the negative-one-half of the square of the standardized value.”

The standardized distance (x−μ)/σ is the number of standard deviations x is away from the mean. The formula says: the farther x is from μ (in standard deviation units), the smaller the density.

The 1/(σ√2π) constant ensures the total area = 1 regardless of σ. It’s just a normalizing constant — students don’t need to remember it.

Empirical rule: - 68%: about 2/3 of observations within ±1σ. “Most people” are within 1 SD of the mean. - 95%: used as the basis for confidence intervals (±2σ ≈ ±1.96σ for exact 95%). - 99.7%: “virtually everyone” is within 3σ. Observations beyond 3σ are called outliers in many contexts.

Standard Normal: When μ=0, σ=1, the (x−μ)/σ inside the exponent simplifies to z². The standard Normal z-table was historically the only way to compute Normal probabilities; now R and apps handle any μ, σ directly.

Ask: “If heights are N(175, 9) [mean 175 cm, variance 9 = SD 3], what range captures 95% of heights?” Expected: 175 ± 6 = [169, 181] cm.

Normal app

Task

Set μ=5, σ=2. Generate 100 values. Does the histogram look bell-shaped?
Increase n to 1000. What changes? What stays approximately the same?
Set μ=0, σ=1 (standard normal). What proportion falls between −1 and +1?
Change σ to 0.5, then to 4. What changes in the shape? What stays the same?

Open app in a new tab

Task 1: With n=100, the histogram will be roughly bell-shaped but noisy — students should see the shape but also sampling variability.

After the live app work, direct students to Exercise 04 for a written follow-up.

Task 2: With n=1000, the histogram should closely resemble the theoretical density. Ask: “What stays the same?” The mean and SD estimates stay close to μ=5, σ=2. What changes: the histogram becomes smoother, the bars more closely follow the curve.

Task 3: Standard Normal, proportion in [−1, +1] ≈ 68%. This directly demonstrates the empirical rule. Ask: “What proportion falls outside [−1, +1]?” Expected: ~32%, split roughly 16% each side.

Task 4: σ = 0.5 → tall, narrow bell (tightly concentrated around μ). σ = 4 → wide, flat bell (widely dispersed). What stays the same: center (μ=0), total area = 1, symmetric shape. What changes: spread and height. Ask: “If σ doubles, what happens to the height of the peak?” Expected: height halves (since area must stay 1).

Debrief question: “If you generate a new random sample, do you get the same histogram?” Expected: No — sampling variability. “What makes them converge?” Expected: larger n.

t, Chi-squared, F — bridge to hypothesis testing

Three distributions you will encounter in hypothesis testing — all derived from the Normal:

Distribution	Origin	Used in
Student t(ν)	standardized mean when σ is unknown	one-sample, two-sample t-tests
Chi-squared χ²(ν)	sum of ν squared standard Normals	variance tests, goodness-of-fit, contingency tables
F(ν₁, ν₂)	ratio of two independent χ²/ν	ANOVA, regression

You do not choose these distributions — the test procedure selects them automatically based on what assumptions are met.

They all have heavier tails than the Normal because of added uncertainty from estimating σ. As ν → ∞, t → N(0,1) and χ²/ν → 1.

Frame this as a preview, not new material: “You will work with t, χ², and F in detail in the hypothesis testing unit. Today you just need to know where they come from.”

Student t: The t-distribution arises when you compute (X̄ − μ) / (S / √n) — the same z-score you’d compute for the Normal, but with S (estimated SD) replacing σ (known). The extra uncertainty from estimating σ widens the tails. With large samples (ν = n−1 large), S ≈ σ and t ≈ z.

Chi-squared: If Z₁, …, Zν are independent standard Normals, then Z₁² + … + Zν² ~ χ²(ν). This is why chi-squared appears in tests involving variances and categorical counts — both ultimately involve squared deviations from expected values.

F-distribution: Ratio of two chi-squared variables (each divided by their degrees of freedom). ANOVA tests whether between-group variance is large relative to within-group variance — that ratio is the F-statistic.

Key takeaway: The entire hypothesis testing machinery rests on the Normal distribution as its foundation. You built that foundation today.

Key ideas — distribution selection table

Topic	What to retain
Bernoulli	one trial, binary; building block for Binomial
Binomial	n fixed trials, count successes; mean = np; variance = np(1−p)
Geometric / Neg. Binomial	count failures before r-th success; memoryless (Geometric)
Hypergeometric	Binomial without replacement; use when n/N ≥ 0.1
Poisson	events at rate λ; mean = variance = λ (diagnostic)
Multinomial	K ≥ 3 categories; each margin is Binomial
Uniform	equal probability on [a, b]; base for simulation
Normal	sum of many factors; 68-95-99.7 rule; parameters μ, σ²
t, χ², F	derived from Normal; chosen by the test, not by you

Exit problem (pairs, 5 min)

A factory produces chips. Defects occur at an average rate of 2.5 per hour. An inspector samples 20 chips from today’s batch of 400, which contains 30 defectives.

Which distribution describes the hourly defect count? Name two parameters.
Which distribution describes the inspector’s sample? Why is it NOT the same as (1)?
A colleague says “use Binomial for (2).” When would this approximation be acceptable?

Pairs: 5 minutes. Walk around and listen to discussions.

Instruction (not on slide): use the decision table first, then verify (1) with the Poisson app.

Expected answers:

Q1: Poisson(λ = 2.5). Parameters: λ only (Poisson has one parameter). The defect process is rate-based — defects occur at λ = 2.5 per hour with no fixed trial count.

Q2: Hypergeometric(N = 400, M = 30, n = 20). The inspector is sampling n = 20 chips from a finite population of N = 400, of which M = 30 are defective. Sampling is without replacement (each chip is either in the sample or not). This is NOT Poisson because we are sampling from a fixed finite batch, not observing a rate over a time window.

Q3: Binomial is acceptable when n/N < 0.1. Here n/N = 20/400 = 0.05 = 5% < 10%, so Binomial(n=20, p=30/400=0.075) is an acceptable approximation. The closer n/N is to 0, the better the Binomial approximation.

Debrief: Ask one pair for Q1. Then reveal Q2 and explain WHY Hypergeometric. Then Q3 — the rule of thumb 10%.

This problem appeared in the roadmap on slide 11 (Hypergeometric). Connecting back reinforces the decision framework.