Introduction to Probability Distributions

What to Use When — and How

Session roadmap

Observe your data
        ↓
Is it discrete (counts) or continuous (measurements)?
        ↓                               ↓
Discrete: Bernoulli, Binomial,    Continuous: Uniform, Normal
          Geometric, Neg. Binomial,       (t, χ², F → hypothesis testing)
          Hypergeometric, Poisson,
          Multinomial

One thread: identify the process that generated the data, then choose the distribution.

Pre-reading check

Hands up: true or false?

  1. “Counting events in a fixed time window at a constant rate → use the Binomial distribution”    false

  2. “The Poisson distribution has the property that its mean equals its variance”    true

  3. “The Normal distribution extends to ±∞ and has no finite maximum”    true

What is a probability distribution?

A probability distribution is a complete description of all possible outcomes and their probabilities.

  • Discrete distributions assign probability to countable outcomes: P(X = k) for k = 0, 1, 2, …
  • Continuous distributions assign probability to intervals: P(a ≤ X ≤ b) = area under f(x)

Two things to always specify: 1. Support — what values can X take? 2. Parameters — which numbers control the shape?

Every distribution in today’s session answers: “what is the data-generating process?”

How to choose a distribution

When you observe… Distribution Parameters
One trial, binary outcome Bernoulli p
n fixed trials, count successes Binomial n, p
Count failures before 1st success Geometric p
Count failures before r-th success Negative Binomial r, p
Sample n items from N without replacement Hypergeometric N, M, n
Count events in fixed time/space at rate λ Poisson λ
Count items in K ≥ 3 categories Multinomial n, p₁…p_K
Equal probability over interval [a, b] Uniform a, b
Measurement or sum of many independent factors Normal μ, σ²

Bernoulli — the building block

The Bernoulli distribution is the simplest possible: one trial, two outcomes.

\[P(X = 1) = p \quad\quad P(X = 0) = 1 - p\]

  • X = 1 (“success”), X = 0 (“failure”)
  • Single parameter: p = P(success)
  • Mean = p; Variance = p(1−p)

Every Binomial trial is a Bernoulli trial. Binomial(n, p) = sum of n independent Bernoulli(p) variables.

Story: Binomial — hospital births

Setting: A large maternity hospital records an average of 45 births per day. Historically, about 50% of births are recorded as male.

Question: On a given day with 45 births, what is the probability of at most 27 males?

Why Binomial? - Fixed number of trials: n = 45 - Each birth independently classified as male or not: binary outcome - Constant probability: p = 0.5

\[X \sim \text{Binomial}(n = 45,\ p = 0.5)\]

Handbook result: P(X ≤ 27) ≈ 0.9324

Binomial — annotated PMF

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]

Symbol Meaning
n number of trials
k number of successes (the outcome we’re asking about)
p P(success) per trial
(1−p) P(failure) per trial
\(\binom{n}{k}\) number of ways to arrange k successes in n trials

Mean = np     Variance = np(1−p)

Variance is maximal when p = 0.5 and approaches 0 as p → 0 or p → 1.

Binomial app

Task

  1. Set n=45, p=0.5. Compute P(X≤27). Does the result match the handbook?
  2. Increase n to 200, keep p=0.5. How does the shape change?
  3. Set p=0.05, n=100. What does the distribution look like? (Compare to Poisson with λ=5.)
  4. Predict: what happens to variance when p=0.5 vs p=0.1? Verify.

Geometric and Negative Binomial

Geometric — count failures before the first success

\[P(X = k) = (1-p)^k \cdot p \quad\quad k = 0, 1, 2, \ldots\]

  • Example: a B2B sales rep closes deals with p = 0.18. How many failed calls before the first sale?
  • Mean = (1−p)/p; Variance = (1−p)/p²
  • Memoryless: past failures carry no information about future success

Negative Binomial — count failures before the r-th success

\[P(X = k) = \binom{k+r-1}{k}(1-p)^k p^r\]

  • Example: sales pipeline targeting r = 6 closed deals, p = 0.25. Geometric is the special case r = 1.
  • Used for overdispersed count data (variance > mean)

Hypergeometric — sampling without replacement

When to use: sampling n items from a finite population of N, where M items have the property of interest, without replacement.

\[P(X = k) = \frac{\binom{M}{k}\binom{N-M}{n-k}}{\binom{N}{n}}\]

Symbol Meaning
N total population size
M number in population with the property
n sample size
k number in sample with the property

Example: An auditor samples n = 80 invoices from N = 1,200, of which M = 90 are flagged for errors. What is P(X ≥ 5)?

Key contrast with Binomial: Binomial assumes sampling with replacement (or infinite population). Use Hypergeometric when n/N ≥ 0.1.

Multinomial — K categories

When to use: n independent trials, each with K ≥ 3 possible outcomes with fixed probabilities p₁, p₂, …, p_K.

\[P(X_1 = k_1, \ldots, X_K = k_K) = \frac{n!}{k_1! \cdots k_K!}\, p_1^{k_1} \cdots p_K^{k_K}\]

where \(k_1 + k_2 + \cdots + k_K = n\) and \(p_1 + p_2 + \cdots + p_K = 1\).

Example: A SaaS support team classifies tickets into K = 4 categories: billing (30%), technical (45%), account (15%), other (10%). In a batch of n = 40 tickets, what is P(billing ≥ 15, technical ≥ 20)?

Each margin is Binomial: \(X_j \sim \text{Binomial}(n, p_j)\).

Binomial is the special case K = 2.

Story: Poisson — SOC alerts (λ = 6.2)

Setting: A security operations center (SOC) receives an average of λ = 6.2 high-priority alerts per hour. Operations policy requires immediate escalation when 10 or more alerts arrive in one hour.

Question: What is the probability of triggering escalation in any given hour?

Why Poisson? - No fixed number of trials (alerts can arrive at any moment) - Events occur independently at a constant average rate λ - Two alerts cannot arrive at exactly the same instant

\[X \sim \text{Poisson}(\lambda = 6.2)\]

Handbook result: P(X ≥ 10) = 1 − P(X ≤ 9) ≈ 0.1218

Poisson — annotated PMF

\[P(X = k) = \frac{\lambda^k\, e^{-\lambda}}{k!}\]

Symbol Meaning
λ average rate (events per time/space window)
k observed count (the outcome we’re asking about)
e Euler’s number ≈ 2.718
k! k factorial — accounts for event ordering

Mean = λ     Variance = λ (equal — this is the Poisson diagnostic)

If your data has variance ≫ mean → consider Negative Binomial (overdispersion).

Poisson app

Task

  1. Set λ=6.2. Compute P(X≥10). What escalation rate does this imply?
  2. Increase λ to 20. Compare the shape to Binomial(n=200, p=0.1). What do you notice?
  3. Lower λ to 0.5. What is the most likely count? Why does this make sense?
  4. For each value of λ, verify: does mean ≈ variance in the output?

From discrete to continuous

Discrete distributions assign probability to individual points: P(X = k).

Continuous distributions assign probability to intervals: P(a ≤ X ≤ b) = area under f(x).

The shift happens when: - Outcomes are measurements, not counts - The possible values form a continuum (every real number in a range)

Key rule: For continuous distributions, P(X = x) = 0 for any single value x.

Probability lives in area, not in height.

Uniform — equal probability

When to use: every value in the interval [a, b] is equally likely.

\[f(x) = \frac{1}{b-a} \quad\quad a \leq x \leq b\]

Symbol Meaning
a lower bound
b upper bound
1/(b−a) constant density (height of the rectangle)

Mean = (a+b)/2     Variance = (b−a)²/12

Applications: - Random number generation (computers generate U(0,1) first, then transform) - Models for “I have no information about which value in [a, b] is more likely” - Rounding errors: if a value is rounded to the nearest integer, the rounding error ~ U(−0.5, 0.5)

P(c ≤ X ≤ d) = (d − c)/(b − a) — just the fraction of the interval.

Story: Normal — sum of many small effects

Why does the Normal distribution appear everywhere?

Human height is the result of hundreds of genetic and environmental factors, each contributing a tiny amount. When many independent small effects add up, the sum follows a Normal distribution — regardless of the original distributions of the individual effects.

This is the Central Limit Theorem: the sample mean \(\bar{X}\) of n independent draws converges to Normal as n → ∞, no matter the original distribution.

Common Normal applications: - Measurement errors (many tiny sources of error summing up) - Test scores (sum of many item responses) - Financial returns over a day (many trades contributing) - Biological measurements: height, weight, blood pressure

Normal — annotated PDF

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}}\, e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]

Symbol Meaning
μ mean — center of the distribution
σ standard deviation — spread (σ > 0)
σ² variance
(x−μ)/σ standardized distance from center

Empirical rule (68-95-99.7): - 68% of values fall within μ ± σ - 95% of values fall within μ ± 2σ - 99.7% of values fall within μ ± 3σ

Standard Normal: μ = 0, σ = 1    →    denoted Z ~ N(0, 1)

Normal app

Task

  1. Set μ=5, σ=2. Generate 100 values. Does the histogram look bell-shaped?
  2. Increase n to 1000. What changes? What stays approximately the same?
  3. Set μ=0, σ=1 (standard normal). What proportion falls between −1 and +1?
  4. Change σ to 0.5, then to 4. What changes in the shape? What stays the same?

t, Chi-squared, F — bridge to hypothesis testing

Three distributions you will encounter in hypothesis testing — all derived from the Normal:

Distribution Origin Used in
Student t(ν) standardized mean when σ is unknown one-sample, two-sample t-tests
Chi-squared χ²(ν) sum of ν squared standard Normals variance tests, goodness-of-fit, contingency tables
F(ν₁, ν₂) ratio of two independent χ²/ν ANOVA, regression

You do not choose these distributions — the test procedure selects them automatically based on what assumptions are met.

They all have heavier tails than the Normal because of added uncertainty from estimating σ. As ν → ∞, t → N(0,1) and χ²/ν → 1.

Key ideas — distribution selection table

Topic What to retain
Bernoulli one trial, binary; building block for Binomial
Binomial n fixed trials, count successes; mean = np; variance = np(1−p)
Geometric / Neg. Binomial count failures before r-th success; memoryless (Geometric)
Hypergeometric Binomial without replacement; use when n/N ≥ 0.1
Poisson events at rate λ; mean = variance = λ (diagnostic)
Multinomial K ≥ 3 categories; each margin is Binomial
Uniform equal probability on [a, b]; base for simulation
Normal sum of many factors; 68-95-99.7 rule; parameters μ, σ²
t, χ², F derived from Normal; chosen by the test, not by you

Exit problem (pairs, 5 min)

A factory produces chips. Defects occur at an average rate of 2.5 per hour. An inspector samples 20 chips from today’s batch of 400, which contains 30 defectives.

  1. Which distribution describes the hourly defect count? Name two parameters.
  2. Which distribution describes the inspector’s sample? Why is it NOT the same as (1)?
  3. A colleague says “use Binomial for (2).” When would this approximation be acceptable?