Descriptive Statistics & EDA — Lecture 1

Describing One Variable

Session roadmap

Data type ──→ Choose tool ──→ Describe distribution ──→ Test normality ──→ Detect problems
 scales + data type   tables + plots   center + spread   normality checks   anomalies/fraud
                     ╲                                           ╱
                      ╲────── same tools diagnose model residuals ──────╱

One question drives the lecture: “What does one variable look like — and can you trust the data?”

Pre-reading check

Hands up: true or false?

  1. “The mean is always the best measure of central tendency.”    false

  2. “A QQ plot plots data values against theoretical quantiles — points on the diagonal line means the assumed distribution fits well.”    true

  3. “Benford’s Law states that in naturally-occurring data, the digit 1 appears as the leading digit about 30% of the time.”    true

Types of data and measurement scales

Scale Ordered? Equal intervals? True zero? Example
Nominal Blood type: A, B, AB, O
Ordinal Pain: 1 = mild, 5 = severe
Interval Temperature in °C
Ratio Income in £, height in cm

Why it matters: the scale constrains which summary statistics and plots are valid.

  • Nominal → mode, bar chart, contingency table
  • Ordinal → median, boxplot (use with care)
  • Interval/Ratio → mean, SD, histogram, all tests

Selection framework: which tool for which data?

Goal Nominal Ordinal Continuous (ratio/interval)
Display distribution Bar chart Bar chart or boxplot Histogram, KDE, stem-and-leaf
Summarise centre Mode Median Mean (symmetric) or Median (skewed)
Summarise spread IQR SD (symmetric) or IQR/MAD (skewed)
Two variables Contingency table Rank correlation Scatterplot, Pearson r
Normality check QQ plot, skewness/kurtosis tests
Detect data problems Benford plot, terminal digit analysis

Decision rule: identify the scale first, then select the tool from this table.

Categorical data: frequency table and bar chart

Frequency table counts observations in each category.

Category Count Relative freq Cumulative freq
Blood type A 42 42% 42%
Blood type O 35 35% 77%
Blood type B 15 15% 92%
Blood type AB 8 8% 100%

Bar chart = heights are frequencies (or proportions). Gaps between bars signal nominal data.

Histogram = area = frequency; no gaps; requires interval/ratio data.

Confusing these two is one of the most common errors in student reports.

Contingency table: cross-tabulating two categorical variables

Cross-tabulation counts joint frequencies: rows = one variable, columns = another.

Survived Died Row total
First class 203 122 325
Second class 118 167 285
Third class 178 528 706
Column total 499 817 1316

Conditional frequency (row %): of third-class passengers, 178/706 = 25.2% survived.

Chi-squared test tests whether survival is independent of class. We return to this in the hypothesis testing unit.

From contingency table to confusion matrix

A confusion matrix is a contingency table where rows = actual class, columns = predicted class.

Predicted: Positive Predicted: Negative
Actual: Positive TP (True Positive) FN (False Negative)
Actual: Negative FP (False Positive) TN (True Negative)

Key derived metrics:

\[ \text{Sensitivity} = \frac{TP}{TP + FN} \qquad \text{Specificity} = \frac{TN}{TN + FP} \qquad \text{PPV} = \frac{TP}{TP + FP} \]

Accuracy = (TP + TN) / N — misleading when classes are imbalanced.

ROC curve: trading sensitivity against specificity

ROC = Receiver Operating Characteristic. Plots sensitivity (TPR) vs. 1 − specificity (FPR) as the classification threshold varies.

Sensitivity (TPR)
    1.0 ┤    ╭──────────────
        │   ╱
    0.5 ┤  ╱ ← Good classifier
        │ ╱
    0.0 ┤╱______________
        0.0    0.5    1.0
              1 − Specificity (FPR)

AUC (area under curve): 0.5 = random; 1.0 = perfect.

Usage: choose a threshold by fixing an acceptable FPR (e.g., “I will tolerate 5% false positives; what sensitivity do I get?”).

Story: assignment submission times

A university lecturer downloads submission timestamps for 847 essays due at midnight Friday.

Submissions per hour (Friday):
16:00 ████
17:00 ██████
18:00 ████████████   ← late-afternoon cluster
19:00 █████████████████
20:00 ██████
21:00 ████
22:00 ██████████████████████  ← last-minute cluster
23:00 █████████████████████████████████
23:59 █████████████████████████████

The histogram reveals a bimodal distribution. Two student types: early submitters and last-minute submitters.

Question: what does this mean for deadline policy?

Histogram: annotated

Histogram divides the range of a continuous variable into \(k\) equal-width bins. Bar height = frequency (or density).

\[ \text{Sturges' rule:} \quad k = 1 + \log_2 n \]

\(n\) Sturges \(k\)
50 7
200 9
1000 11
10000 14

Binning is a loss function: too few bins → oversmoothed, structure hidden. Too many bins → noisy, structure obscured.

Density histogram: \(y\)-axis = relative frequency / bin width. Area sums to 1 — comparable across datasets of different sizes.

Histogram and frequency table app

Task — student survey submission times (\(n = 139\), times in seconds)

  1. What does the histogram shape suggest about this distribution? (Symmetric? Skewed? Bimodal?)
  2. Use the bin-width slider. Find the minimum bins that reveal any hidden structure.
  3. Switch to a density histogram. What changes on the y-axis? Why is this useful?
  4. Read the frequency table below the plot. How does the table relate to each bar?
  5. Describe the distribution in three words: shape, centre, spread.

Stem-and-leaf: preserving raw data without binning

Stem-and-leaf plot: each observation split into a stem (leading digit(s)) and leaf (next digit). Preserves every data value.

Exam scores (n = 20):
  5 | 2 8
  6 | 1 3 5 7 9
  7 | 0 2 4 6 8 8
  8 | 1 3 5 9
  9 | 0 4

Back-to-back stem-and-leaf: compare two groups on the same stem axis.

Advantages: exact values recoverable; visible gaps, clusters, outliers. Limitation: unwieldy for \(n > 200\).

Story: Benford’s Law — catching fraud with a frequency plot

Benford’s Law: in naturally-occurring numerical data, the leading digit \(d\) appears with frequency

\[ P(d) = \log_{10}\!\left(1 + \frac{1}{d}\right) \]

Leading digit Expected %
1 30.1%
2 17.6%
3 12.5%
9 4.6%

Applications: financial fraud detection, election results, COVID-19 reported cases.

Why it works: numbers that arise from multiplicative processes (prices, populations, measurements) naturally follow Benford’s distribution.

Data quality forensics app

Task — heart disease data pre-loaded (bloodpressureNum)

  1. Examine the terminal digit distribution. Which digits appear far more often than expected? What is the expected count per digit?
  2. What is the most plausible explanation for this pattern? How might it affect a statistical analysis of blood pressure?
  3. Switch dataset to credit (top of sidebar). Select a financial variable (e.g., Credit amount). Do the leading digits follow Benford’s Law?
  4. Does the deviation from Benford look like random noise or a systematic pattern?

Central tendency: which mean to use?

Arithmetic mean: \(\bar{x} = \frac{1}{n}\sum x_i\) — minimises squared deviations; sensitive to outliers.

Median: middle value; minimises absolute deviations; robust to outliers and skewness.

Mode: most frequent value; the only valid summary for nominal data.

Geometric mean: \(\bar{x}_g = \left(\prod x_i\right)^{1/n}\) — appropriate for growth rates, ratios, log-normal data.

Harmonic mean: \(\bar{x}_h = n / \sum (1/x_i)\) — appropriate for rates (e.g., average speed over equal distances).

Decision rule: skewed or outlier-prone → median. Growth/ratio data → geometric mean. Rate data → harmonic mean.

Variability: SD vs. IQR vs. MAD

Measure Formula Robust? Use when
Variance \(s^2\) \(\frac{1}{n-1}\sum(x_i - \bar{x})^2\) input to further computations
SD \(s\) \(\sqrt{s^2}\) symmetric, approximately normal data
IQR \(Q_3 - Q_1\) skewed data, outlier-prone data
MAD \(\text{median}(|x_i - \tilde{x}|)\) heavy-tailed distributions
CV \(s / \bar{x} \times 100\%\) comparing variability across different units

Why three measures? SD assumes the mean is the right centre. IQR uses quantiles — valid without a mean. MAD uses the median — most robust.

Boxplot: five-number summary + outlier detection

Five-number summary: Min, Q1, Median, Q3, Max.

         ┌─────────┐
    ──┤  │         │  ├────── ●   ●
         └─────────┘
   Min   Q1  Med  Q3         Outliers
        ←── IQR ──→

Outlier rule (Tukey): observation is a potential outlier if it falls beyond \(Q1 - 1.5 \times IQR\) or \(Q3 + 1.5 \times IQR\).

Notched boxplot: notch \(\approx \pm 1.58 \times IQR / \sqrt{n}\). If notches of two boxplots do not overlap, medians differ significantly (\(p < 0.05\), approx.).

Bridge: skewness, kurtosis, and the Cullen-Frey plot

Skewness measures asymmetry: \(\gamma_1 = \frac{m_3}{m_2^{3/2}}\)

  • \(\gamma_1 > 0\) → right-skewed (long right tail, mean > median)
  • \(\gamma_1 < 0\) → left-skewed (long left tail, mean < median)
  • \(|\gamma_1| > 1\) is practically significant

Excess kurtosis measures tail weight vs. Normal: \(\gamma_2 = \frac{m_4}{m_2^2} - 3\)

  • \(\gamma_2 > 0\) → heavier tails than Normal (leptokurtic)
  • \(\gamma_2 < 0\) → lighter tails (platykurtic)

Cullen-Frey plot: plots \((\gamma_1^2, \gamma_2)\) for your data against reference points for common distributions (Normal, Lognormal, Gamma, Beta, …). Guides distribution identification.

Normality tests app: skewness, kurtosis, Cullen-Frey

Task — variable bwt (birth weight in grams) from the birthwt dataset, pre-loaded on the Skewness-Kurtosis Plot

  1. Read the skewness value. Is birth weight right- or left-skewed?
  2. Where does the sample point fall on the Cullen-Frey plot? Which distribution family is closest?
  3. Switch the test type (top of sidebar) to “Skewness and Kurtosis Test”. Read the skewness and kurtosis values and their standard errors. Are they far from zero?

QQ plot: the single best normality diagnostic

Quantile-Quantile plot: plots the sample quantiles (\(y\)-axis) against the theoretical quantiles of the reference distribution (\(x\)-axis).

Interpretation: - Points on the diagonal line → distribution fits - S-shaped curve → heavier or lighter tails than normal - Systematic upward bow → right skewness - Systematic downward bow → left skewness - Single outlier point → one extreme observation

Key property: works for any reference distribution (Normal, Exponential, Gamma, …). Most commonly used to assess normality.

Normal probability plot is equivalent; orientation may differ by software.

See the fitdistrnorm interactive plot in the handbook chapter qqplot.

Bridge: same EDA tools for residual diagnostics

Every EDA tool you learned today also diagnoses model residuals:

EDA tool Use on raw data Use on residuals
Histogram Describe shape of variable Check if residuals are approximately Normal
Boxplot Detect outliers in raw data Detect extreme residuals (leverage/influence)
QQ plot Test normality of variable Required check: normality of residuals for valid t-tests
Scatterplot Explore bivariate relationships Residuals vs. fitted → detect non-linearity, heteroskedasticity
ACF plot Detect serial structure in raw data Critical check: independence of residuals in time series models

The first four tools are from today. ACF is new — it is introduced here as a preview and covered in depth in Lecture 2.

Key ideas — selection table

Question Tool What you read
What type is this variable? Measurement scale classification Nominal / Ordinal / Interval / Ratio
What is the distribution shape? Histogram, stem-and-leaf, KDE Symmetric / skewed / bimodal / outliers
What are typical values? Mean, median, mode Centre of distribution
How spread out? SD, IQR, MAD Concentration vs. dispersion
Are there extreme observations? Boxplot (Tukey rule) Whiskers, outlier dots
Is the distribution Normal? QQ plot, Cullen-Frey plot Points on diagonal / distribution family
Is the data trustworthy? Benford plot, terminal digit analysis Digit frequency deviations
Two categorical variables? Contingency table Row/column percentages
Classification quality? Confusion matrix, ROC Sensitivity, specificity, AUC

Exit problem (pairs, 5 min)

A financial auditor receives a CSV of 8,000 invoice amounts from a supplier. She runs a Shapiro-Wilk normality test and finds \(p < 0.001\).

  1. Which graphical tool should she use first to understand the overall distribution of invoice amounts?

  2. Which tool would reveal whether invoice amounts follow Benford’s Law?

  3. The test returns \(p < 0.001\). Does this mean she has found fraud? What should she do next?

  4. She decides to log-transform the invoice amounts. Which plot would she use to verify that the transformed data is approximately Normal?

Use the handbook selection table. Then discuss with your partner.