Descriptive Statistics & EDA

Session roadmap

Data type ──→ Choose tool ──→ Describe distribution ──→ Test normality ──→ Detect problems
 scales + data type   tables + plots   center + spread   normality checks   anomalies/fraud
                     ╲                                           ╱
                      ╲────── same tools diagnose model residuals ──────╱

One question drives the lecture: “What does one variable look like — and can you trust the data?”

Pre-reading check

Hands up: true or false?

“The mean is always the best measure of central tendency.” false
“A QQ plot plots data values against theoretical quantiles — points on the diagonal line means the assumed distribution fits well.” true
“Benford’s Law states that in naturally-occurring data, the digit 1 appears as the leading digit about 30% of the time.” true

How to run this: Ask each question verbally, gather hands, THEN click to reveal the answer.

Q1 (false): Many students say “true” by reflex. Good teachable moment: mean is pulled by outliers; income data is the standard example. Median is more representative when there is skewness.

Q2 (true): Most students who did the reading get this right. Worth pausing to check they understand which axis is which — data quantiles on y, theoretical on x (or the reverse depending on software, but the line is the key).

Q3 (true): Benford’s Law predicts leading digit 1 appears about 30% of the time in many naturally-occurring datasets. Deviations can signal manipulation or a non-natural process. About 70% get this right after reading. Those who didn’t will be surprised by the app. Don’t reveal the COVID angle yet — set up the suspense.

Keep Q3 in mind — we will test this with real data in about 30 minutes.

Types of data and measurement scales

Scale	Ordered?	Equal intervals?	True zero?	Example
Nominal	✗	✗	✗	Blood type: A, B, AB, O
Ordinal	✓	✗	✗	Pain: 1 = mild, 5 = severe
Interval	✓	✓	✗	Temperature in °C
Ratio	✓	✓	✓	Income in £, height in cm

Why it matters: the scale constrains which summary statistics and plots are valid.

Nominal → mode, bar chart, contingency table
Ordinal → median, boxplot (use with care)
Interval/Ratio → mean, SD, histogram, all tests

Selection framework: which tool for which data?

Goal	Nominal	Ordinal	Continuous (ratio/interval)
Display distribution	Bar chart	Bar chart or boxplot	Histogram, KDE, stem-and-leaf
Summarise centre	Mode	Median	Mean (symmetric) or Median (skewed)
Summarise spread	—	IQR	SD (symmetric) or IQR/MAD (skewed)
Two variables	Contingency table	Rank correlation	Scatterplot, Pearson r
Normality check	—	—	QQ plot, skewness/kurtosis tests
Detect data problems	—	—	Benford plot, terminal digit analysis

Decision rule: identify the scale first, then select the tool from this table.

Categorical data: frequency table and bar chart

Frequency table counts observations in each category.

Category	Count	Relative freq	Cumulative freq
Blood type A	42	42%	42%
Blood type O	35	35%	77%
Blood type B	15	15%	92%
Blood type AB	8	8%	100%

Bar chart = heights are frequencies (or proportions). Gaps between bars signal nominal data.

Histogram = area = frequency; no gaps; requires interval/ratio data.

Confusing these two is one of the most common errors in student reports.

Contingency table: cross-tabulating two categorical variables

Cross-tabulation counts joint frequencies: rows = one variable, columns = another.

	Survived	Died	Row total
First class	203	122	325
Second class	118	167	285
Third class	178	528	706
Column total	499	817	1316

Conditional frequency (row %): of third-class passengers, 178/706 = 25.2% survived.

Chi-squared test tests whether survival is independent of class. We return to this in the hypothesis testing unit.

From contingency table to confusion matrix

A confusion matrix is a contingency table where rows = actual class, columns = predicted class.

	Predicted: Positive	Predicted: Negative
Actual: Positive	TP (True Positive)	FN (False Negative)
Actual: Negative	FP (False Positive)	TN (True Negative)

Key derived metrics:

\[ \text{Sensitivity} = \frac{TP}{TP + FN} \qquad \text{Specificity} = \frac{TN}{TN + FP} \qquad \text{PPV} = \frac{TP}{TP + FP} \]

Accuracy = (TP + TN) / N — misleading when classes are imbalanced.

ROC curve: trading sensitivity against specificity

ROC = Receiver Operating Characteristic. Plots sensitivity (TPR) vs. 1 − specificity (FPR) as the classification threshold varies.

Sensitivity (TPR)
    1.0 ┤    ╭──────────────
        │   ╱
    0.5 ┤  ╱ ← Good classifier
        │ ╱
    0.0 ┤╱______________
        0.0    0.5    1.0
              1 − Specificity (FPR)

AUC (area under curve): 0.5 = random; 1.0 = perfect.

Usage: choose a threshold by fixing an acceptable FPR (e.g., “I will tolerate 5% false positives; what sensitivity do I get?”).

Story: assignment submission times

A university lecturer downloads submission timestamps for 847 essays due at midnight Friday.

Submissions per hour (Friday):
16:00 ████
17:00 ██████
18:00 ████████████   ← late-afternoon cluster
19:00 █████████████████
20:00 ██████
21:00 ████
22:00 ██████████████████████  ← last-minute cluster
23:00 █████████████████████████████████
23:59 █████████████████████████████

The histogram reveals a bimodal distribution. Two student types: early submitters and last-minute submitters.

Question: what does this mean for deadline policy?

Set the scene verbally: “You are a lecturer. You have 847 submission times recorded to the minute. What do you plot first, and what would you expect to find?”

Bimodal interpretation: The two modes correspond to two student behaviour types. This is not noise — it is a real sub-population structure. A unimodal assumption (e.g., fitting a single normal) would completely miss this.

Policy question: “Should the deadline be earlier? Should there be a soft deadline? Does the last-minute cluster correlate with worse grades?” These are questions the histogram raises but cannot answer — they require further analysis.

Key message: “A histogram is not just a pretty picture. It tells you whether your data has subgroups, whether it is skewed, and whether Gaussian assumptions are plausible.”

Transition: “Let’s understand how a histogram is actually constructed — the choice of bin width matters.”

Histogram: annotated

Histogram divides the range of a continuous variable into $k$ equal-width bins. Bar height = frequency (or density).

\[ \text{Sturges' rule:} \quad k = 1 + \log_2 n \]

$n$	Sturges $k$
50	7
200	9
1000	11
10000	14

Binning is a loss function: too few bins → oversmoothed, structure hidden. Too many bins → noisy, structure obscured.

Density histogram: $y$-axis = relative frequency / bin width. Area sums to 1 — comparable across datasets of different sizes.

Sturges’ rule: Not gospel — it is a guideline. For highly skewed or bimodal data, it underestimates the optimal $k$. More sophisticated methods (Freedman-Diaconis) use IQR to set bin width, but Sturges is the standard default.

Demonstration (verbal): “If I take the submission data and use $k=3$ bins, I see ‘morning’, ‘afternoon’, ‘evening’. The bimodal pattern disappears. With $k=50$, each bin has ~17 observations and looks jagged. Sturges gives $k = 1 + \log_2(847) \approx 11$ — approximately right.”

Density vs. frequency: “If dataset A has 200 observations and dataset B has 1000, I cannot overlay their frequency histograms directly — the bars have different scales. Density normalizes by bin width so both sum to 1 and can be overlaid.”

Transition to app: “Let’s explore this interactively.”

Histogram and frequency table app

Task — student survey submission times ($n = 139$, times in seconds)

What does the histogram shape suggest about this distribution? (Symmetric? Skewed? Bimodal?)
Use the bin-width slider. Find the minimum bins that reveal any hidden structure.
Switch to a density histogram. What changes on the y-axis? Why is this useful?
Read the frequency table below the plot. How does the table relate to each bar?
Describe the distribution in three words: shape, centre, spread.

Open app in a new tab

What the app shows: Histogram with adjustable bin width, optional density overlay, frequency table, and basic summary statistics. The submission-time data is pre-loaded via URL bookmark.

After the live app work, direct students to Exercise 03 for a written follow-up.

Task 1: The data is right-skewed with several extreme outliers (one student took over 4200 s ≈ 70 min). The bulk of the distribution is below 700 s.

Task 2: The bimodal pattern in submission times requires enough bins to separate the two modes. Too few bins merge them into one lump. Minimum bins to reveal bimodality ≈ 8–10 for this dataset.

Task 3: Density histogram: y-axis changes from counts to counts/(n × bin width). The shape is identical; only the scale changes. Useful for overlaying a theoretical density curve.

Task 4: Each row of the frequency table corresponds to one bar. The table makes precise the bin boundaries and counts.

Debrief: “Find a dataset where the mean and median are far apart. What does the histogram look like?” → Right-skewed (long right tail) or outliers.

Common mistake: Students set bins to maximum and claim the data is “uniform” because every bin has ~1 observation. Remind: too many bins destroy structure.

Stem-and-leaf: preserving raw data without binning

Stem-and-leaf plot: each observation split into a stem (leading digit(s)) and leaf (next digit). Preserves every data value.

Exam scores (n = 20):
  5 | 2 8
  6 | 1 3 5 7 9
  7 | 0 2 4 6 8 8
  8 | 1 3 5 9
  9 | 0 4

Back-to-back stem-and-leaf: compare two groups on the same stem axis.

Advantages: exact values recoverable; visible gaps, clusters, outliers. Limitation: unwieldy for $n > 200$.

Story: Benford’s Law — catching fraud with a frequency plot

Benford’s Law: in naturally-occurring numerical data, the leading digit $d$ appears with frequency

\[ P(d) = \log_{10}\!\left(1 + \frac{1}{d}\right) \]

Leading digit	Expected %
1	30.1%
2	17.6%
3	12.5%
…	…
9	4.6%

Applications: financial fraud detection, election results, COVID-19 reported cases.

Why it works: numbers that arise from multiplicative processes (prices, populations, measurements) naturally follow Benford’s distribution.

The mechanism: Any variable that spans multiple orders of magnitude will tend to have more small leading digits. Intuitively: if a stock price starts at $1 and doubles repeatedly, it spends more time in the $1–$2 range than the $5–$9 range before each doubling.

COVID angle: Some countries’ reported COVID case counts had too many leading 5s, 6s, 7s — suggesting rounding or reporting manipulation. This does not prove fraud, but it flags data for review.

Financial fraud: Fraudsters who invent expense claims tend to use “round” or “arbitrary” digits (5s, 7s, 3s). A Benford analysis of expense reports can highlight suspicious submitters.

Key caveat: Not all data follows Benford: lottery numbers, telephone numbers, zip codes, random numbers — these are generated by uniform processes and will not follow Benford. The law applies only to naturally-occurring, multiplicative-process data.

Transition: “Let’s test this on real data in the app.”

Data quality forensics app

Task — heart disease data pre-loaded (bloodpressureNum)

Examine the terminal digit distribution. Which digits appear far more often than expected? What is the expected count per digit?
What is the most plausible explanation for this pattern? How might it affect a statistical analysis of blood pressure?
Switch dataset to credit (top of sidebar). Select a financial variable (e.g., Credit amount). Do the leading digits follow Benford’s Law?
Does the deviation from Benford look like random noise or a systematic pattern?

Open app in a new tab

What the app shows: Terminal digit analysis and Benford’s Law frequency plot. Available datasets: heart (Cleveland heart disease), credit (German credit, 1000 observations), coffee, iris, mtcars. The heart dataset is pre-loaded via bookmark.

After the live app work, direct students to Exercise 04 for a written follow-up.

Task 1 — heart / bloodpressureNum: Blood pressures cluster on multiples of 5 and 10 (terminal digits 0 and 5 are far more common than others — over half of all BP values end in 0). Expected count per digit ≈ 303/10 ≈ 30. This is “digit preference” — clinicians unconsciously round BP to the nearest 5 mmHg when reading manual sphygmomanometers.

Task 2 impact: Artificial discretisation at multiples of 5 distorts the true distribution. The Shapiro-Wilk normality test may reject normality not because BP is non-normal but because of the artificial rounding spikes. Statistical power is also reduced since the measurement precision is artificially limited.

Task 3 — credit / Credit amount: Loan amounts span multiple orders of magnitude (hundreds to tens of thousands), so Benford’s Law should apply. Students check whether leading digit 1 appears ~30% of the time.

Debrief: “Is digit preference fraud?” → No — it is a systematic measurement artefact from human reading of analogue instruments. The forensic value is in identifying it so its impact can be quantified.

Central tendency: which mean to use?

Arithmetic mean: $\bar{x} = \frac{1}{n}\sum x_i$ — minimises squared deviations; sensitive to outliers.

Median: middle value; minimises absolute deviations; robust to outliers and skewness.

Mode: most frequent value; the only valid summary for nominal data.

Geometric mean: $\bar{x}_g = \left(\prod x_i\right)^{1/n}$ — appropriate for growth rates, ratios, log-normal data.

Harmonic mean: $\bar{x}_h = n / \sum (1/x_i)$ — appropriate for rates (e.g., average speed over equal distances).

Decision rule: skewed or outlier-prone → median. Growth/ratio data → geometric mean. Rate data → harmonic mean.

Variability: SD vs. IQR vs. MAD

Measure	Formula	Robust?	Use when
Variance $s^2$	$\frac{1}{n-1}\sum(x_i - \bar{x})^2$	✗	input to further computations
SD $s$	$\sqrt{s^2}$	✗	symmetric, approximately normal data
IQR	$Q_3 - Q_1$	✓	skewed data, outlier-prone data
MAD	$\text{median}(\|x_i - \tilde{x}\|)$	✓	heavy-tailed distributions
CV	$s / \bar{x} \times 100\%$	✗	comparing variability across different units

Why three measures? SD assumes the mean is the right centre. IQR uses quantiles — valid without a mean. MAD uses the median — most robust.

Why variance $s^2$ uses $n-1$: Dividing by $n$ gives a biased estimate of the population variance (it systematically underestimates $\sigma^2$). Dividing by $n-1$ (Bessel’s correction) makes the estimator unbiased. For large $n$, it barely matters; for small samples, it can make a difference.

CV example: Hospital A reports SD of patient wait times = 8 minutes (mean = 20 min). Hospital B reports SD = 12 minutes (mean = 60 min). Hospital B has a higher SD but lower CV (20% vs. 40%). Which is more variable relative to its own average? Hospital A.

Robustness: Give students an extreme example: salaries [25k, 27k, 28k, 30k, 250k]. SD = 98k — dominated by the outlier. IQR = 3k — barely affected. Which is more useful for describing a “typical” spread? IQR.

Transition: “We now combine centre and spread into a single plot — the boxplot.”

Boxplot: five-number summary + outlier detection

Five-number summary: Min, Q1, Median, Q3, Max.

         ┌─────────┐
    ──┤  │         │  ├────── ●   ●
         └─────────┘
   Min   Q1  Med  Q3         Outliers
        ←── IQR ──→

Outlier rule (Tukey): observation is a potential outlier if it falls beyond $Q1 - 1.5 \times IQR$ or $Q3 + 1.5 \times IQR$.

Notched boxplot: notch $\approx \pm 1.58 \times IQR / \sqrt{n}$. If notches of two boxplots do not overlap, medians differ significantly ($p < 0.05$, approx.).

Why 1.5 × IQR? Tukey chose 1.5 empirically: for a normal distribution, this rule flags about 0.7% of observations as outliers — a small but non-negligible fraction. The rule is a starting point for investigation, not a definitive judgment.

Outlier action: “When you see a boxplot outlier, you must investigate, not delete. Is it a data entry error? A measurement error? A genuine extreme observation? These require different responses.”

AMS survey example (from boxplot.qmd): The handbook shows an AMS survey dataset where the boxplot reveals two clusters — possibly two distinct groups of respondents. This is the same bimodal structure we saw in the histogram.

Comparison of groups: Boxplot’s great strength is side-by-side comparison. “I can show 10 groups on one plot and immediately see which groups have higher medians, wider spread, or more outliers.” The histogram struggles at more than 2–3 overlapping groups.

Transition: “The boxplot describes spread. We now want to characterize shape: skewness and kurtosis.”

Bridge: skewness, kurtosis, and the Cullen-Frey plot

Skewness measures asymmetry: $\gamma_1 = \frac{m_3}{m_2^{3/2}}$

$\gamma_1 > 0$ → right-skewed (long right tail, mean > median)
$\gamma_1 < 0$ → left-skewed (long left tail, mean < median)
$|\gamma_1| > 1$ is practically significant

Excess kurtosis measures tail weight vs. Normal: $\gamma_2 = \frac{m_4}{m_2^2} - 3$

$\gamma_2 > 0$ → heavier tails than Normal (leptokurtic)
$\gamma_2 < 0$ → lighter tails (platykurtic)

Cullen-Frey plot: plots $(\gamma_1^2, \gamma_2)$ for your data against reference points for common distributions (Normal, Lognormal, Gamma, Beta, …). Guides distribution identification.

D’Agostino normality test (from skewkurt.qmd): Uses skewness and kurtosis jointly to test $H_0$: the data is normally distributed. Rejection means at least one of skewness or kurtosis is significantly different from the Normal values (0 and 0).

Cullen-Frey plot interpretation: “Your sample’s $(\gamma_1^2, \gamma_2)$ point is plotted. If it falls near the ‘Normal’ region, normality is plausible. If it falls near ‘Gamma’ or ‘Lognormal’, those distributions might fit better. This is a distribution identification tool, not a formal test.”

birthwt dataset (from skewkurt.qmd): Birth weights in the MASS package. Birthweight is slightly right-skewed — the Cullen-Frey plot places it near the Gamma/Lognormal region. Students will verify this in the app.

Sample vs. population: Both skewness and kurtosis are highly variable in small samples. With $n < 50$, confidence in the Cullen-Frey placement is low. Sample variability region in the plot conveys this.

Transition: “Let’s explore this interactively with the normality tests app.”

Normality tests app: skewness, kurtosis, Cullen-Frey

Task — variable bwt (birth weight in grams) from the birthwt dataset, pre-loaded on the Skewness-Kurtosis Plot

Read the skewness value. Is birth weight right- or left-skewed?
Where does the sample point fall on the Cullen-Frey plot? Which distribution family is closest?
Switch the test type (top of sidebar) to “Skewness and Kurtosis Test”. Read the skewness and kurtosis values and their standard errors. Are they far from zero?

Open app in a new tab

What the app shows: The birthwt dataset (MASS package, $n = 189$) is pre-loaded with variable bwt (birth weight in grams). The Skewness-Kurtosis Plot (Cullen-Frey) opens by default. Students switch to “Skewness and Kurtosis Test” for numerical output.

After the live app work, direct students to Exercise 06 for a written follow-up.

Expected results (bwt): - Skewness ≈ +0.15 (very mild right skew) - Excess kurtosis ≈ +0.2 (approximately Normal tails) - Cullen-Frey: sample point falls near the Normal/Lognormal boundary with a wide bootstrap variability region

Task 1: Skewness is positive (right-skewed) but close to 0 — practically negligible.

Task 2: The Cullen-Frey point is near the Normal region. The large bootstrap variability region (shaded cloud around the sample point) shows that with $n = 189$ the placement is uncertain — any distribution in that region is plausible.

Task 3: Switch to “Skewness and Kurtosis Test”. Read skewness ≈ 0.15 and kurtosis ≈ 0.2 — both close to 0 (the Normal values). Standard errors reflect sampling variability; the values are not statistically different from Normal values at standard significance levels.

Key message (no formal hypothesis test needed yet): “The Cullen-Frey plot tells you visually which distribution family fits. The sample point falls near Normal — so assuming normality for birth weight is reasonable. Formal tests of this will come in the Hypothesis Testing unit.”

QQ plot: the single best normality diagnostic

Quantile-Quantile plot: plots the sample quantiles ($y$-axis) against the theoretical quantiles of the reference distribution ($x$-axis).

Interpretation: - Points on the diagonal line → distribution fits - S-shaped curve → heavier or lighter tails than normal - Systematic upward bow → right skewness - Systematic downward bow → left skewness - Single outlier point → one extreme observation

Key property: works for any reference distribution (Normal, Exponential, Gamma, …). Most commonly used to assess normality.

Normal probability plot is equivalent; orientation may differ by software.

See the fitdistrnorm interactive plot in the handbook chapter qqplot.

Why QQ plot > histogram for normality? A histogram’s shape depends heavily on bin width. The QQ plot is invariant to binning. It shows deviations from normality systematically and in a way that is easy to read once learned.

Reading the QQ plot (verbally): “The x-axis shows where points would fall if the data were perfectly normal. The y-axis shows where they actually fall. If a point is above the line in the right tail, your right tail is heavier than normal — you have more extreme high values than a normal distribution would predict.”

Belgium births example (from qqplot.qmd): Daily birth counts in Belgium. The QQ plot shows mild deviation from normal in the tails — consistent with a slight leptokurtic distribution. The S-shaped curl is a classic presentation.

Key message for residuals: “When you fit a regression model, you will plot the QQ plot of the residuals. The same interpretation applies: are the residuals plausibly normally distributed? If not, your t-tests and confidence intervals may be unreliable.”

No dedicated app slide: Students practice in Exercise 06 using the chi_squared_tests app (which shows QQ plots in the normality tests section).

Bridge: same EDA tools for residual diagnostics

Every EDA tool you learned today also diagnoses model residuals:

EDA tool	Use on raw data	Use on residuals
Histogram	Describe shape of variable	Check if residuals are approximately Normal
Boxplot	Detect outliers in raw data	Detect extreme residuals (leverage/influence)
QQ plot	Test normality of variable	Required check: normality of residuals for valid t-tests
Scatterplot	Explore bivariate relationships	Residuals vs. fitted → detect non-linearity, heteroskedasticity
ACF plot	Detect serial structure in raw data	Critical check: independence of residuals in time series models

The first four tools are from today. ACF is new — it is introduced here as a preview and covered in depth in Lecture 2.

Why this slide? Students often treat EDA as “something you do before analysis” and residual diagnostics as “something different.” This slide makes explicit that they are the same toolkit applied at different stages.

Walk-through (fast): “Histogram — you used it on submission times. You will use it on your regression residuals to check normality. QQ plot — same story. Scatterplot — on raw data, you explore relationships. On residuals vs. fitted, you check for non-linearity. ACF — this one is new. We will cover it properly in Lecture 2; for now, just note that it is the tool that checks whether consecutive residuals are correlated.”

Memory hook: “EDA = describe the data. Residual diagnostics = describe the model’s errors. Same pictures, different subject matter.”

Transition: “This is a preview of Lecture 2, where we will build on this. For now, let’s consolidate everything.”

Key ideas — selection table

Question	Tool	What you read
What type is this variable?	Measurement scale classification	Nominal / Ordinal / Interval / Ratio
What is the distribution shape?	Histogram, stem-and-leaf, KDE	Symmetric / skewed / bimodal / outliers
What are typical values?	Mean, median, mode	Centre of distribution
How spread out?	SD, IQR, MAD	Concentration vs. dispersion
Are there extreme observations?	Boxplot (Tukey rule)	Whiskers, outlier dots
Is the distribution Normal?	QQ plot, Cullen-Frey plot	Points on diagonal / distribution family
Is the data trustworthy?	Benford plot, terminal digit analysis	Digit frequency deviations
Two categorical variables?	Contingency table	Row/column percentages
Classification quality?	Confusion matrix, ROC	Sensitivity, specificity, AUC

Exit problem (pairs, 5 min)

A financial auditor receives a CSV of 8,000 invoice amounts from a supplier. She runs a Shapiro-Wilk normality test and finds $p < 0.001$.

Which graphical tool should she use first to understand the overall distribution of invoice amounts?
Which tool would reveal whether invoice amounts follow Benford’s Law?
The test returns $p < 0.001$. Does this mean she has found fraud? What should she do next?
She decides to log-transform the invoice amounts. Which plot would she use to verify that the transformed data is approximately Normal?

Use the handbook selection table. Then discuss with your partner.

Pairs: 5 minutes. Walk around and listen.

Expected answers:

Histogram + boxplot. Histogram shows the overall shape (skewed? bimodal? outliers?). Boxplot shows the five-number summary and flags potential outliers using the Tukey rule. Both together give the most information first.
Benford plot / data quality forensics. The frequency of leading digits should follow $P(d) = \log_{10}(1 + 1/d)$. A bar chart of leading digit frequencies, overlaid with the Benford expected frequencies, shows any systematic deviation.
No — large $n$ makes any deviation significant. With $n = 8,000$, even a trivial departure from normality yields $p < 0.001$. She should inspect the Benford deviation plot and look for systematic patterns (excess round numbers, missing digits) that suggest manipulation. A statistically significant result does not equal economically significant fraud.
QQ plot on log-transformed values. If the log-transformation is appropriate, the QQ plot of $\log(\text{amount})$ against Normal quantiles should show points close to the diagonal line. Systematic deviation in the tails would suggest the lognormal model is still inadequate.

After 3 minutes: Ask one pair for Q1, then Q3. The Q3 answer is the key — students must distinguish statistical significance from practical meaning.

Measure	Formula	Robust?	Use when
Variance \(s^2\)	\(\frac{1}{n-1}\sum(x_i - \bar{x})^2\)	✗	input to further computations
SD \(s\)	\(\sqrt{s^2}\)	✗	symmetric, approximately normal data
IQR	\(Q_3 - Q_1\)	✓	skewed data, outlier-prone data
MAD	\(\text{median}(\|x_i - \tilde{x}\|)\)	✓	heavy-tailed distributions
CV	\(s / \bar{x} \times 100\%\)	✗	comparing variability across different units

Descriptive Statistics & EDA — Lecture 1

Session roadmap

Pre-reading check

Types of data and measurement scales

Selection framework: which tool for which data?

Categorical data: frequency table and bar chart

Contingency table: cross-tabulating two categorical variables

From contingency table to confusion matrix

ROC curve: trading sensitivity against specificity

Story: assignment submission times

Histogram: annotated

Histogram and frequency table app

Stem-and-leaf: preserving raw data without binning

Story: Benford’s Law — catching fraud with a frequency plot

Data quality forensics app

Central tendency: which mean to use?

Variability: SD vs. IQR vs. MAD

Boxplot: five-number summary + outlier detection

Bridge: skewness, kurtosis, and the Cullen-Frey plot

Normality tests app: skewness, kurtosis, Cullen-Frey

QQ plot: the single best normality diagnostic

Bridge: same EDA tools for residual diagnostics

Key ideas — selection table

Exit problem (pairs, 5 min)