Data type ──→ Choose tool ──→ Describe distribution ──→ Test normality ──→ Detect problems
scales + data type tables + plots center + spread normality checks anomalies/fraud
╲ ╱
╲────── same tools diagnose model residuals ──────╱
One question drives the lecture: “What does one variable look like — and can you trust the data?”
Pre-reading check
Hands up: true or false?
“The mean is always the best measure of central tendency.” false
“A QQ plot plots data values against theoretical quantiles — points on the diagonal line means the assumed distribution fits well.” true
“Benford’s Law states that in naturally-occurring data, the digit 1 appears as the leading digit about 30% of the time.” true
Types of data and measurement scales
Scale
Ordered?
Equal intervals?
True zero?
Example
Nominal
✗
✗
✗
Blood type: A, B, AB, O
Ordinal
✓
✗
✗
Pain: 1 = mild, 5 = severe
Interval
✓
✓
✗
Temperature in °C
Ratio
✓
✓
✓
Income in £, height in cm
Why it matters: the scale constrains which summary statistics and plots are valid.
Nominal → mode, bar chart, contingency table
Ordinal → median, boxplot (use with care)
Interval/Ratio → mean, SD, histogram, all tests
Selection framework: which tool for which data?
Goal
Nominal
Ordinal
Continuous (ratio/interval)
Display distribution
Bar chart
Bar chart or boxplot
Histogram, KDE, stem-and-leaf
Summarise centre
Mode
Median
Mean (symmetric) or Median (skewed)
Summarise spread
—
IQR
SD (symmetric) or IQR/MAD (skewed)
Two variables
Contingency table
Rank correlation
Scatterplot, Pearson r
Normality check
—
—
QQ plot, skewness/kurtosis tests
Detect data problems
—
—
Benford plot, terminal digit analysis
Decision rule: identify the scale first, then select the tool from this table.
Categorical data: frequency table and bar chart
Frequency table counts observations in each category.
Category
Count
Relative freq
Cumulative freq
Blood type A
42
42%
42%
Blood type O
35
35%
77%
Blood type B
15
15%
92%
Blood type AB
8
8%
100%
Bar chart = heights are frequencies (or proportions). Gaps between bars signal nominal data.
Histogram = area = frequency; no gaps; requires interval/ratio data.
Confusing these two is one of the most common errors in student reports.
Contingency table: cross-tabulating two categorical variables
Harmonic mean:\(\bar{x}_h = n / \sum (1/x_i)\) — appropriate for rates (e.g., average speed over equal distances).
Decision rule: skewed or outlier-prone → median. Growth/ratio data → geometric mean. Rate data → harmonic mean.
Variability: SD vs. IQR vs. MAD
Measure
Formula
Robust?
Use when
Variance \(s^2\)
\(\frac{1}{n-1}\sum(x_i - \bar{x})^2\)
✗
input to further computations
SD \(s\)
\(\sqrt{s^2}\)
✗
symmetric, approximately normal data
IQR
\(Q_3 - Q_1\)
✓
skewed data, outlier-prone data
MAD
\(\text{median}(|x_i - \tilde{x}|)\)
✓
heavy-tailed distributions
CV
\(s / \bar{x} \times 100\%\)
✗
comparing variability across different units
Why three measures? SD assumes the mean is the right centre. IQR uses quantiles — valid without a mean. MAD uses the median — most robust.
Boxplot: five-number summary + outlier detection
Five-number summary: Min, Q1, Median, Q3, Max.
┌─────────┐
──┤ │ │ ├────── ● ●
└─────────┘
Min Q1 Med Q3 Outliers
←── IQR ──→
Outlier rule (Tukey): observation is a potential outlier if it falls beyond \(Q1 - 1.5 \times IQR\) or \(Q3 + 1.5 \times IQR\).
Notched boxplot: notch \(\approx \pm 1.58 \times IQR / \sqrt{n}\). If notches of two boxplots do not overlap, medians differ significantly (\(p < 0.05\), approx.).
Bridge: skewness, kurtosis, and the Cullen-Frey plot
\(\gamma_2 > 0\) → heavier tails than Normal (leptokurtic)
\(\gamma_2 < 0\) → lighter tails (platykurtic)
Cullen-Frey plot: plots \((\gamma_1^2, \gamma_2)\) for your data against reference points for common distributions (Normal, Lognormal, Gamma, Beta, …). Guides distribution identification.
Task — variable bwt (birth weight in grams) from the birthwt dataset, pre-loaded on the Skewness-Kurtosis Plot
Read the skewness value. Is birth weight right- or left-skewed?
Where does the sample point fall on the Cullen-Frey plot? Which distribution family is closest?
Switch the test type (top of sidebar) to “Skewness and Kurtosis Test”. Read the skewness and kurtosis values and their standard errors. Are they far from zero?
Quantile-Quantile plot: plots the sample quantiles (\(y\)-axis) against the theoretical quantiles of the reference distribution (\(x\)-axis).
Interpretation: - Points on the diagonal line → distribution fits - S-shaped curve → heavier or lighter tails than normal - Systematic upward bow → right skewness - Systematic downward bow → left skewness - Single outlier point → one extreme observation
Key property: works for any reference distribution (Normal, Exponential, Gamma, …). Most commonly used to assess normality.
Normal probability plot is equivalent; orientation may differ by software.
See the fitdistrnorm interactive plot in the handbook chapter qqplot.
Bridge: same EDA tools for residual diagnostics
Every EDA tool you learned today also diagnoses model residuals:
EDA tool
Use on raw data
Use on residuals
Histogram
Describe shape of variable
Check if residuals are approximately Normal
Boxplot
Detect outliers in raw data
Detect extreme residuals (leverage/influence)
QQ plot
Test normality of variable
Required check: normality of residuals for valid t-tests
Scatterplot
Explore bivariate relationships
Residuals vs. fitted → detect non-linearity, heteroskedasticity
ACF plot
Detect serial structure in raw data
Critical check: independence of residuals in time series models
The first four tools are from today. ACF is new — it is introduced here as a preview and covered in depth in Lecture 2.
Key ideas — selection table
Question
Tool
What you read
What type is this variable?
Measurement scale classification
Nominal / Ordinal / Interval / Ratio
What is the distribution shape?
Histogram, stem-and-leaf, KDE
Symmetric / skewed / bimodal / outliers
What are typical values?
Mean, median, mode
Centre of distribution
How spread out?
SD, IQR, MAD
Concentration vs. dispersion
Are there extreme observations?
Boxplot (Tukey rule)
Whiskers, outlier dots
Is the distribution Normal?
QQ plot, Cullen-Frey plot
Points on diagonal / distribution family
Is the data trustworthy?
Benford plot, terminal digit analysis
Digit frequency deviations
Two categorical variables?
Contingency table
Row/column percentages
Classification quality?
Confusion matrix, ROC
Sensitivity, specificity, AUC
Exit problem (pairs, 5 min)
A financial auditor receives a CSV of 8,000 invoice amounts from a supplier. She runs a Shapiro-Wilk normality test and finds \(p < 0.001\).
Which graphical tool should she use first to understand the overall distribution of invoice amounts?
Which tool would reveal whether invoice amounts follow Benford’s Law?
The test returns \(p < 0.001\). Does this mean she has found fraud? What should she do next?
She decides to log-transform the invoice amounts. Which plot would she use to verify that the transformed data is approximately Normal?
Use the handbook selection table. Then discuss with your partner.