Introduction to Probability

Session roadmap

Probability language ──→ Bayes updating ──→ Test accuracy ──→ Simulation variability
    events + conditioning    priors + evidence   prevalence effects   simulation averages
           ╲                                      ╱
            ╲──── NBC: Bayes for classification ─╱

Five topics. One thread: how to reason consistently under uncertainty.

Pre-reading check

Hands up: true or false?

P(not A) = 1 − P(A) true
“P(A | B) means: probability of B given A” false
“A test with 99% sensitivity means a positive result is 99% likely correct”

false

Warm-up: prior and posterior

Warm-up (5 min)

Which number is your starting belief?
Which numbers describe test quality?
Which number changes after a positive result?
Why can the updated probability still be low?

Open app in a new tab

Probability language

What you must retain

Probabilities are numbers between 0 and 1
P(not A) = 1 − P(A) — complement rule
P(A | B) means: probability of A given B
Independence: knowing B does not change the probability of A

Jeffreys: axioms govern reasoning, not measurement

Jeffreys’ definition: probability is the degree of confidence that we may reasonably have in a proposition.

Frequentist definitions (von Mises, Fisher) claim probability is a frequency — measurement is built into the definition.

Jeffreys’ framework claims nothing about measurement — only about reasoning:

“Once you have accepted any probabilities, you must reason consistently with them.”

Bayes’ theorem follows as a consequence of consistent reasoning — not as an arbitrary formula.

Three rules you will use today

\[ \text{P}(\neg A) \;=\; 1 - \text{P}(A) \quad\quad \leftarrow \text{complement rule} \]

\[ \text{P}(A \mid B) \;=\; \frac{\text{P}(A \cap B)}{\text{P}(B)} \quad\quad \leftarrow \text{conditional probability ("given")} \]

\[ \text{P}(A \mid B) \;=\; \text{P}(A) \quad\quad \leftarrow \text{independence: B tells you nothing about A} \]

These three rules are sufficient for everything today.

Story: two sacks of coins

There are two sacks of coins.

Sack 1: 150 gold, 50 silver (75% gold)
Sack 2: 100 gold, 200 silver (33% gold)

A blindfolded person picks one sack at random and draws one coin. You observe a gold coin.

Question: which sack is now more plausible?

Predict the direction first

Say this out loud before touching the app:

“Gold is more common in Sack 1.”
“So observing gold should increase my belief in Sack 1.”

Coin Sacks App

Task

Start with equal priors. Which sack is favored after seeing gold?
Verify: P(Sack 1 | Gold) ≈ 0.692.
Change the prior so Sack 2 is preferred 2:1. Does the conclusion change?
Make the two sacks nearly identical (same coin mix). What happens to the update?

Explain the update in words before reading numbers.

Open app in a new tab

Bayes’ theorem: annotated

\[ \underbrace{\text{P}(A \mid B)}_{\text{posterior}} \;=\; \frac{ \underbrace{\text{P}(B \mid A)}_{\text{likelihood}} \;\times\; \underbrace{\text{P}(A)}_{\text{prior}} }{ \underbrace{\text{P}(B)}_{\text{evidence}} } \]

Read the formula aloud using plain-English labels: “posterior equals likelihood times prior, divided by evidence.”

Term labels (not on slide): posterior = updated belief, likelihood = compatibility of evidence with hypothesis, prior = starting belief, evidence = normalizer.

Coin example mapping: \(A\) = “Sack 1”, \(B\) = “Gold coin drawn”.

Emphasize the evidence term: “P(B) is just a scaling factor — it makes sure all the posterior probabilities add up to 1. In practice, you can often ignore it and just compare the numerators.”

Connect to the app: “In the app you just used, the slider for ‘prior’ is P(A), the coin compositions give the likelihoods, and the output is the posterior.”

Foreshadow sens/spec: “We’re about to apply this exact same formula to medical testing. The terminology changes — sensitivity, specificity — but the structure is identical.”

Story: screening for a rare condition

A financial fraud detection system has:

Prevalence of fraud: 0.2% (2 in every 1,000 transactions)
Sensitivity: 99% (correctly flags 99% of real fraud)
Specificity: 99% (correctly clears 99% of legitimate transactions)

Question: if the system flags a transaction as fraudulent, how likely is it actually fraud?

Positive Test App

Task

Set prevalence = 0.2%, sensitivity = 99%, specificity = 99%.
Record the result. Compare to your prediction from before.
Increase prevalence to 2%. What changes?
Reset. Increase specificity to 99.9%. What changes?
Write one sentence starting: “A flagged transaction means…”

Open app in a new tab

What the app shows: Prevalence, sensitivity, specificity sliders; PPV (Positive Predictive Value) as output.

Keep the visible task card focused on the 5 prompts. After the app work, direct students to Exercise 04 for a written follow-up. Reference result at baseline settings: PPV ≈ 16.6%.

Expected reaction: Students are surprised to see ~16.6%, not 99%. This surprise IS the learning. Let it land before explaining.

Common misconception: “99% accurate test means 99% chance the positive is real.” This is the base-rate fallacy. Name it after students have seen the result.

Debrief question: “If prevalence doubles to 0.4%, what happens to PPV?” Expected: PPV roughly doubles (from ~16.6% to ~28%).

Follow-up: “Which lever has the bigger effect — changing prevalence or changing specificity?” Answer: Prevalence dominates at very low base rates. Specificity becomes more important as prevalence rises.

What to say about a positive result

Use language like this:

“The test is good, but fraud is rare.”
“A flagged transaction increases the probability, but it may still be far from certain.”
“To know how likely it is after a positive result, I need prevalence and test quality.”

Avoid saying:

“The test is 99% accurate, so a flagged transaction is 99% likely fraud.”

The fraction of true positives among all positives (PPV) depends critically on prevalence.

Story: classifying a car’s origin

Cars93 dataset: 93 cars. Predict origin (USA / non-USA) from one feature: Man.trans.avail (Yes / No).

Man.trans.avail	P(feature \| USA)	P(feature \| non-USA)
No	0.542	0.133
Yes	0.458	0.867

Prior: P(USA) = 0.516. For a new car with Man.trans.avail = No:

USA score \(\;\propto\;\) 0.516 × 0.542 = 0.280
non-USA score \(\;\propto\;\) 0.484 × 0.133 = 0.064

Predicted: USA (P ≈ 81%).

The Cars93 dataset is built into R. Origin has two classes: USA (48 cars) and non-USA (45 cars). The feature Man.trans.avail (manual transmission available: Yes/No) is binary — a Bernoulli NBC.

Visible table values come from the app using the full dataset as training.

Intuition for the numbers: Most USA cars in this dataset do NOT offer manual transmission (0.542 No), while most non-USA cars DO (0.867 Yes). So observing “No manual available” is strong evidence for USA origin — likelihood ratio 0.542/0.133 ≈ 4.

Work through the arithmetic on the board. Ask: “Which likelihood ratio is more informative?” Answer: Man.trans.avail=No has ratio 4.1 (strongly USA); Man.trans.avail=Yes has ratio 0.458/0.867 ≈ 0.53 (modestly non-USA).

Key message: NBC applies to any binary classification — it is not tied to text/spam. Same formula, different domain.

State the assumption explicitly: NBC treats features as independent given the class (here there is only one feature, so the assumption is trivial).

Classifier App — Cars93

Task — select Origin + Man.trans.avail

Read the prior probabilities from the output. What do they represent?
Using the likelihood table, verify the prediction for a car with Man.trans.avail = No. Compute the scores by hand.
Set training set to 70%. Find sensitivity and specificity in the output. What do these numbers tell you about the classifier?
Use Shuffle Data several times. What changes: priors, sensitivity, specificity? Why? What does this variability remind you of from earlier today?

Open app in a new tab

Setup: Students select variables Origin and Man.trans.avail in the app. Origin is the dependent variable (class). Training set slider starts at 1 (100% training, no test set).

Task 1 answer: Prior = proportion of each class in the training data. P(USA) ≈ 0.516, P(non-USA) ≈ 0.484. Plain English: “About half the cars in the dataset are USA-made.”

Task 2 answer: USA score ∝ 0.516 × 0.542 = 0.280; non-USA score ∝ 0.484 × 0.133 = 0.064. Predicted: USA. P(USA | No) = 0.280/(0.280+0.064) ≈ 81%.

Task 3: With training = 70%, the remaining 30% (~28 cars) form the test set. Sensitivity = TP/(TP+FN) (proportion of USA cars correctly identified). Specificity = TN/(TN+FP) (proportion of non-USA correctly identified). Reference handbook table 8.1 (same confusion matrix as in the sensitivity/specificity section — same framework, now applied to a classifier). One-sentence training set explanation: “The training set estimates the model; the test set measures how well it performs on unseen data.”

Task 4: Each shuffle randomly assigns cars to training vs test. Prior probabilities, sensitivity, and specificity all change across shuffles because the training sample is a random subset of the 93 cars. This is the same variability as the hospital simulation — finite samples fluctuate. With more data (not possible here with 93 cars), these estimates would stabilize. That is the Law of Large Numbers applied to model evaluation.

After the live app work, direct students to Exercise 05 for a written follow-up.

Debrief question: “Why do sensitivity and specificity change when you shuffle but the likelihood table changes only slightly?” Expected: The likelihood table is estimated from the training set (which changes with each shuffle) — it does change, but the posterior probabilities are relatively robust because the likelihood ratios are large. Sensitivity/specificity are computed on the small test set, so they fluctuate more.

What is Naive Bayes?

One sentence:

Naive Bayes is Bayes’ theorem applied once per feature, assuming features are independent given the class.

Story: which hospital shows more extreme days?

Two hospitals record the proportion of boys born each day:

Large hospital: ~45 births per day
Small hospital: ~15 births per day

Question: which hospital more often records days with more than 60% boys?

\[ \bar{X}_n \;\xrightarrow{\;n \to \infty\;}\; \mu \quad\quad \text{(Law of Large Numbers)} \]

Predict before simulating.

Hospital Simulation App

Task

Predict first: large or small hospital — which shows more extreme days?
Run the simulation. Was your prediction correct?
Increase the number of simulated days. What stabilizes?
Change the threshold (60% → 80%). What changes?
Explain using the words variability and sample size.

Open app in a new tab

Key ideas

Topic	What to retain
Probability language	P(A\|B) is conditional; independence means B tells you nothing about A
Jeffreys	Axioms govern reasoning, not measurement; Bayes follows from consistency
Bayes’ theorem	Posterior ∝ likelihood × prior; P(B) normalizes
Sensitivity / specificity	Test accuracy ≠ PPV; prevalence dominates at low base rates
Naive Bayes	Bayes per feature, independence assumed; “naive” is intentional
Law of Large Numbers	Sample means converge to μ; small samples fluctuate more

Exit problem (pairs, 5 min)

A rapid test for a rare infection has:

Prevalence: 0.5%
Sensitivity: 98%
Specificity: 95%

You test 1,000 people.

How many false positives do you expect?
What is the probability a positive result is a true infection (PPV)?
If you repeated this tomorrow with 1,000 new people, would the numbers be identical? Why not?