Descriptive Statistics & EDA — Lecture 2

Relationships, Association, and Model Validation

Session roadmap

Scatterplot ──→ Pearson r ──→ Spearman ──→ Concentration ──→ Residual reuse ──→ Time structure
 visual pattern   linear association   rank association   concentration   residual/time checks
                    ╲                                              ╱
                     ╲── same EDA tools validate model residuals ─╱

One question drives the lecture: “How do variables relate — and are your model assumptions met?”

Pre-reading check

Hands up: true or false?

  1. “Pearson correlation \(r = 0.9\) means 90% of the variance in \(Y\) is explained by \(X\).”    false

  2. “Spearman correlation should be used when the relationship between two variables is monotone but not necessarily linear.”    true

  3. “If residuals from a regression model show a pattern in the ACF plot, this is not a problem as long as the model fits well overall.”    false

From one variable to two: scatterplot as the starting point

Scatterplot: each observation is a point \((x_i, y_i)\). The most information-rich bivariate display.

What to look for: - Direction: positive or negative association? - Form: linear or non-linear (curved, step-shaped)? - Strength: tight cluster around a line, or dispersed cloud? - Outliers: points far from the general pattern?

The scatterplot comes first. Do not compute a correlation coefficient before looking at the plot — a single number can miss all four questions above.

Story: coffee prices USA vs. Colombia

Monthly coffee commodity prices: USA market and Colombia market, 1990–2020.

USA price (USD/lb)
   4 ┤                                      ╭──
   3 ┤                               ╭─────╯
   2 ┤            ╭──────────────────╯
   1 ┤────────────╯
     └──────────────────────────────────────
         1990        2000        2010        2020

Colombia price — similar shape but different level and currency.

Hypothesis: two markets trading the same commodity should track together — but how tightly? What breaks the correlation?

Pearson \(r \approx 0.7\) — moderate positive. But is it linear? Are there periods of divergence?

Scatterplot + correlation formula — annotated

Pearson correlation coefficient:

\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{(n-1)\, s_x\, s_y} = \frac{\text{Cov}(X,Y)}{s_X \cdot s_Y} \]

  • Range: \(-1 \leq r \leq 1\)
  • \(r = 0\) means no linear association (not “no association”)
  • \(r^2\) = coefficient of determination = proportion of variance in \(Y\) explained by \(X\)

t-test for \(r\): \[ t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2} \quad \text{under } H_0: \rho = 0 \]

Scatterplot + Pearson + Spearman app

Task — coffee prices: Colombia vs. USA, pre-loaded

  1. Inspect the scatterplot. Does the relationship look linear? Are there periods of divergence?
  2. Record Pearson \(r\) and \(R^2\). What percentage of USA price variance is explained by Colombia price?
  3. Compare Pearson \(r\) to Spearman \(\rho\). Which is larger? What does the difference tell you about linearity?
  4. Is the spread of points constant across the price range, or does it fan out? What does this imply for a regression model?

Pearson: assumptions and when it fails

Pearson r is valid when: 1. Both variables are continuous (interval or ratio scale) 2. The relationship is approximately linear 3. No extreme outliers (one outlier can shift \(r\) by ±0.2 in small samples) 4. (For inference only) Both variables approximately normally distributed

When Pearson fails — use alternatives: - Non-linear monotone relationship → Spearman \(\rho\) - Heavy tails / outliers → Spearman \(\rho\) or Kendall’s \(\tau\) - Controlling for a third variable → Partial correlation - Categorical variable → Cramér’s V (nominal) or rank biserial

Rank correlation: Spearman vs. Kendall’s tau

Spearman \(\rho\): Replace each value with its rank, then compute Pearson on the ranks.

\[ \rho_S = 1 - \frac{6 \sum d_i^2}{n(n^2-1)} \quad \text{where } d_i = \text{rank}(x_i) - \text{rank}(y_i) \]

Kendall’s \(\tau\): Counts concordant pairs (\(C\)) minus discordant pairs (\(D\)):

\[ \tau = \frac{C - D}{\binom{n}{2}} \]

Property Spearman Kendall
Sensitive to all monotone relations
More robust to outliers in ranks moderate high
Interpretable as probability

\(\tau\) has a direct probabilistic interpretation: probability that a randomly chosen pair is concordant minus the probability it is discordant.

Spurious correlations — the trap

Tyler Vigen’s database (tylervigen.com): Nicolas Cage film releases vs. swimming pool drownings — \(r \approx 0.67\), \(p < 0.05\).

Year:            2000  2002  2004  2006  2008  2010
Nicolas Cage films: 2    3     4     2     3     2
Pool drownings:   109  102  117   104  113  103

Why does this happen? Both time series trend together or share seasonal structure driven by a common third factor (e.g., summer activity levels, year-on-year economic growth).

Lesson: correlation is not causation. A statistically significant \(r\) requires: - A plausible causal mechanism - Ruling out confounders - Temporal precedence (cause before effect)

Story: Concentration — who owns how much?

Global wealth distribution (Credit Suisse Wealth Report, 2023):

  • Top 1% own approximately 43% of global wealth
  • Top 10% own approximately 76% of global wealth
  • Bottom 50% own approximately 2% of global wealth
Lorenz curve (illustrative):
     ╭──────────── Perfect equality
    ╱
   ╱  ╭─────────── Actual distribution
  ╱  ╱
 ╱  ╱   ← Gini area = A/(A+B)
╱──╱
└────────
  population (poorest → richest)

Gini coefficient ≈ 0.85 for global wealth. How concentrated is this? You are about to find out.

Gini coefficient and Lorenz curve — annotated

Lorenz curve: plot the cumulative population proportion (\(x\)-axis) against cumulative wealth share (\(y\)-axis), sorted from poorest to richest.

Gini coefficient: \[ G = 1 - 2\int_0^1 L(x)\, dx \;=\; \frac{\text{Area between curve and equality line}}{\text{Total area under equality line}} \]

Herfindahl-Hirschman Index (HHI): concentration in a market with \(n\) firms: \[ \text{HHI} = \sum_{i=1}^n s_i^2 \quad \text{where } s_i = \text{market share of firm } i \]

  • HHI → 0: many small firms (competitive)
  • HHI → 10{,}000 (in %, squared): monopoly

Concentration app

Task — stylised wealth distribution pre-loaded (20 households, values in $k)

  1. Read the Gini coefficient. Is this distribution closer to equality or maximum concentration?
  2. Inspect the Lorenz curve. Approximately what share do the bottom 50% hold? The top 10%?
  3. Replace the data with 20 equal values (e.g., type 100 repeated 20 times). What happens to the Gini and the Lorenz curve?
  4. Restore the original data. Which single value drives most of the concentration? What does removing it do to the Gini?

KDE: the smooth histogram

Kernel Density Estimate (KDE): places a small kernel (usually Gaussian) at each data point and sums them.

\[ \hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right) \]

  • \(h\) = bandwidth (controls smoothness): too small → spiky; too large → oversmoothed
  • Optimal bandwidth (Silverman’s rule): \(h = 1.06\, s\, n^{-1/5}\)

Advantages over histogram: smooth; no bin boundary artefacts; works for overlaying multiple groups.

Extensions: bivariate KDE (density plot of two variables); conditional KDE. See handbook chapters bidensity and conditionaleda.

The residual diagnostic workflow

After fitting any regression or time series model, run these five checks in order:

Step Plot Assumption checked
1 Histogram of residuals Residuals approximately Normal
2 QQ plot of residuals Normality — formal visual test
3 Residuals vs. fitted values Linearity & constant variance (homoskedasticity)
4 Residuals vs. each predictor No missed non-linear relationships
5 ACF of residuals Independence — no serial correlation

If any check fails: the associated inference (t-tests, F-test, CIs) is unreliable. Do not report p-values from a model with violated assumptions.

Residual check: histogram → QQ → scatter → ACF

What to look for in each plot:

Histogram of residuals: - ✓ Roughly bell-shaped, centred at 0 - ✗ Heavy tails, strong skewness → transform Y or use robust regression

QQ plot of residuals: - ✓ Points on the diagonal - ✗ S-curve (both tails deviate) → leptokurtic residuals

Residuals vs. fitted: - ✓ Random cloud, no pattern, constant spread - ✗ U-shape → missed non-linearity (add polynomial term) - ✗ Fan shape (spread increases) → heteroskedasticity (try log Y)

ACF of residuals: - ✓ All lags within Bartlett 95% CI bands (\(\pm 2/\sqrt{n}\)) - ✗ Spike at lag 1 and decaying → AR(1) autocorrelation → wrong SEs

Time series EDA: what to look for before modeling

Before fitting any time series model, describe the raw series:

  1. Trend: is the level rising, falling, or flat over time?
  2. Seasonality: regular periodic cycles (annual, quarterly, weekly)?
  3. Irregular / residual: unexplained variation after removing trend and seasonality?

ACF (Autocorrelation Function): \(\rho_k = \text{Corr}(Y_t, Y_{t-k})\) — how correlated is a series with its own lag-\(k\) past.

PACF (Partial ACF): \(\phi_{kk}\) — correlation at lag \(k\) after controlling for lags 1 through \(k-1\).

Model identification rules (brief): - AR(\(p\)): ACF decays, PACF cuts off after lag \(p\) - MA(\(q\)): ACF cuts off after lag \(q\), PACF decays

Time series + ACF/PACF app

Task — load airline passenger data

  1. Plot the raw time series. Describe trend and seasonality in words.
  2. Inspect the ACF plot. Which lags are significant? What do spikes at lags 12, 24, 36 indicate?
  3. Inspect the PACF. After which lag does it cut off? What AR order does this suggest?
  4. Bridge to residuals: “If my model’s residuals had this ACF pattern, what would that mean?”
  5. (Debrief) What transformation would you apply first before modeling? Why?

Selection guide walkthrough (live demo)

The handbook’s interactive selection guide maps your data situation to the appropriate method.

Live demo: Use the constraint picker to answer: - “I have two continuous variables and want to measure their association.” → Pearson (if linear, normal) or Spearman (if non-linear or skewed) - “I want to describe inequality in a distribution.” → Gini coefficient + Lorenz curve - “I want to check whether my regression residuals are autocorrelated.” → ACF plot of residuals

Rule: If you are unsure which method to use — consult the selection guide before choosing.

Key ideas — selection table (L2)

Question Tool What you read
Two continuous variables — association? Scatterplot first, then Pearson \(r\) Direction, form, strength, outliers; \(r^2\) = variance explained
Non-linear or ranked data — association? Spearman \(\rho\) or Kendall \(\tau\) Monotone association; \(\tau\) has probabilistic interpretation
Inequality in a distribution? Gini coefficient + Lorenz curve 0 = equality, 1 = maximum concentration; area under Lorenz
Market concentration? Herfindahl-Hirschman Index Closer to 10,000 → more concentrated
Smooth distributional shape? KDE (bandwidth selection) Peaks, modes, tails
Are residuals Normal? QQ plot + histogram of residuals Points on diagonal; bell-shaped histogram
Are residuals independent? ACF of residuals No significant spikes beyond Bartlett bands
Is there serial structure in raw data? ACF + PACF Trend, seasonality, AR/MA order

Exit problem (pairs, 5 min)

A logistics analyst models delivery time (hours) as a function of distance and package weight using linear regression. She inspects the diagnostic plots:

  • Plot A: residuals vs. fitted values — shows a clear U-shape
  • Plot B: QQ plot of residuals — points deviate from the diagonal in both tails
  • Plot C: ACF of residuals — spike at lag 1 (\(\rho_1 \approx 0.6\)) and decreasing spikes thereafter
  1. Which assumption does Plot A violate? What should she do?
  2. Which assumption does Plot B suggest may be violated?
  3. What does Plot C tell her, and why does it matter for inference?
  4. She also finds Gini = 0.72 for delivery times across routes. What does this say?

Use the selection table on the previous slide.