Relationships, Association, and Model Validation
Scatterplot ──→ Pearson r ──→ Spearman ──→ Concentration ──→ Residual reuse ──→ Time structure
visual pattern linear association rank association concentration residual/time checks
╲ ╱
╲── same EDA tools validate model residuals ─╱
One question drives the lecture: “How do variables relate — and are your model assumptions met?”
Hands up: true or false?
“Pearson correlation \(r = 0.9\) means 90% of the variance in \(Y\) is explained by \(X\).” false
“Spearman correlation should be used when the relationship between two variables is monotone but not necessarily linear.” true
“If residuals from a regression model show a pattern in the ACF plot, this is not a problem as long as the model fits well overall.” false
Scatterplot: each observation is a point \((x_i, y_i)\). The most information-rich bivariate display.
What to look for: - Direction: positive or negative association? - Form: linear or non-linear (curved, step-shaped)? - Strength: tight cluster around a line, or dispersed cloud? - Outliers: points far from the general pattern?
The scatterplot comes first. Do not compute a correlation coefficient before looking at the plot — a single number can miss all four questions above.
Monthly coffee commodity prices: USA market and Colombia market, 1990–2020.
USA price (USD/lb)
4 ┤ ╭──
3 ┤ ╭─────╯
2 ┤ ╭──────────────────╯
1 ┤────────────╯
└──────────────────────────────────────
1990 2000 2010 2020
Colombia price — similar shape but different level and currency.
Hypothesis: two markets trading the same commodity should track together — but how tightly? What breaks the correlation?
Pearson \(r \approx 0.7\) — moderate positive. But is it linear? Are there periods of divergence?
Pearson correlation coefficient:
\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{(n-1)\, s_x\, s_y} = \frac{\text{Cov}(X,Y)}{s_X \cdot s_Y} \]
t-test for \(r\): \[ t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2} \quad \text{under } H_0: \rho = 0 \]
Task — coffee prices: Colombia vs. USA, pre-loaded
Pearson r is valid when: 1. Both variables are continuous (interval or ratio scale) 2. The relationship is approximately linear 3. No extreme outliers (one outlier can shift \(r\) by ±0.2 in small samples) 4. (For inference only) Both variables approximately normally distributed
When Pearson fails — use alternatives: - Non-linear monotone relationship → Spearman \(\rho\) - Heavy tails / outliers → Spearman \(\rho\) or Kendall’s \(\tau\) - Controlling for a third variable → Partial correlation - Categorical variable → Cramér’s V (nominal) or rank biserial
Spearman \(\rho\): Replace each value with its rank, then compute Pearson on the ranks.
\[ \rho_S = 1 - \frac{6 \sum d_i^2}{n(n^2-1)} \quad \text{where } d_i = \text{rank}(x_i) - \text{rank}(y_i) \]
Kendall’s \(\tau\): Counts concordant pairs (\(C\)) minus discordant pairs (\(D\)):
\[ \tau = \frac{C - D}{\binom{n}{2}} \]
| Property | Spearman | Kendall |
|---|---|---|
| Sensitive to all monotone relations | ✓ | ✓ |
| More robust to outliers in ranks | moderate | high |
| Interpretable as probability | ✗ | ✓ |
\(\tau\) has a direct probabilistic interpretation: probability that a randomly chosen pair is concordant minus the probability it is discordant.
Tyler Vigen’s database (tylervigen.com): Nicolas Cage film releases vs. swimming pool drownings — \(r \approx 0.67\), \(p < 0.05\).
Year: 2000 2002 2004 2006 2008 2010
Nicolas Cage films: 2 3 4 2 3 2
Pool drownings: 109 102 117 104 113 103
Why does this happen? Both time series trend together or share seasonal structure driven by a common third factor (e.g., summer activity levels, year-on-year economic growth).
Lesson: correlation is not causation. A statistically significant \(r\) requires: - A plausible causal mechanism - Ruling out confounders - Temporal precedence (cause before effect)
Global wealth distribution (Credit Suisse Wealth Report, 2023):
Lorenz curve (illustrative):
╭──────────── Perfect equality
╱
╱ ╭─────────── Actual distribution
╱ ╱
╱ ╱ ← Gini area = A/(A+B)
╱──╱
└────────
population (poorest → richest)
Gini coefficient ≈ 0.85 for global wealth. How concentrated is this? You are about to find out.
Lorenz curve: plot the cumulative population proportion (\(x\)-axis) against cumulative wealth share (\(y\)-axis), sorted from poorest to richest.
Gini coefficient: \[ G = 1 - 2\int_0^1 L(x)\, dx \;=\; \frac{\text{Area between curve and equality line}}{\text{Total area under equality line}} \]
Herfindahl-Hirschman Index (HHI): concentration in a market with \(n\) firms: \[ \text{HHI} = \sum_{i=1}^n s_i^2 \quad \text{where } s_i = \text{market share of firm } i \]
Task — stylised wealth distribution pre-loaded (20 households, values in $k)
100 repeated 20 times). What happens to the Gini and the Lorenz curve?Kernel Density Estimate (KDE): places a small kernel (usually Gaussian) at each data point and sums them.
\[ \hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right) \]
Advantages over histogram: smooth; no bin boundary artefacts; works for overlaying multiple groups.
Extensions: bivariate KDE (density plot of two variables); conditional KDE. See handbook chapters bidensity and conditionaleda.
After fitting any regression or time series model, run these five checks in order:
| Step | Plot | Assumption checked |
|---|---|---|
| 1 | Histogram of residuals | Residuals approximately Normal |
| 2 | QQ plot of residuals | Normality — formal visual test |
| 3 | Residuals vs. fitted values | Linearity & constant variance (homoskedasticity) |
| 4 | Residuals vs. each predictor | No missed non-linear relationships |
| 5 | ACF of residuals | Independence — no serial correlation |
If any check fails: the associated inference (t-tests, F-test, CIs) is unreliable. Do not report p-values from a model with violated assumptions.
What to look for in each plot:
Histogram of residuals: - ✓ Roughly bell-shaped, centred at 0 - ✗ Heavy tails, strong skewness → transform Y or use robust regression
QQ plot of residuals: - ✓ Points on the diagonal - ✗ S-curve (both tails deviate) → leptokurtic residuals
Residuals vs. fitted: - ✓ Random cloud, no pattern, constant spread - ✗ U-shape → missed non-linearity (add polynomial term) - ✗ Fan shape (spread increases) → heteroskedasticity (try log Y)
ACF of residuals: - ✓ All lags within Bartlett 95% CI bands (\(\pm 2/\sqrt{n}\)) - ✗ Spike at lag 1 and decaying → AR(1) autocorrelation → wrong SEs
Before fitting any time series model, describe the raw series:
ACF (Autocorrelation Function): \(\rho_k = \text{Corr}(Y_t, Y_{t-k})\) — how correlated is a series with its own lag-\(k\) past.
PACF (Partial ACF): \(\phi_{kk}\) — correlation at lag \(k\) after controlling for lags 1 through \(k-1\).
Model identification rules (brief): - AR(\(p\)): ACF decays, PACF cuts off after lag \(p\) - MA(\(q\)): ACF cuts off after lag \(q\), PACF decays
Task — load airline passenger data
The handbook’s interactive selection guide maps your data situation to the appropriate method.
Live demo: Use the constraint picker to answer: - “I have two continuous variables and want to measure their association.” → Pearson (if linear, normal) or Spearman (if non-linear or skewed) - “I want to describe inequality in a distribution.” → Gini coefficient + Lorenz curve - “I want to check whether my regression residuals are autocorrelated.” → ACF plot of residuals
Rule: If you are unsure which method to use — consult the selection guide before choosing.
| Question | Tool | What you read |
|---|---|---|
| Two continuous variables — association? | Scatterplot first, then Pearson \(r\) | Direction, form, strength, outliers; \(r^2\) = variance explained |
| Non-linear or ranked data — association? | Spearman \(\rho\) or Kendall \(\tau\) | Monotone association; \(\tau\) has probabilistic interpretation |
| Inequality in a distribution? | Gini coefficient + Lorenz curve | 0 = equality, 1 = maximum concentration; area under Lorenz |
| Market concentration? | Herfindahl-Hirschman Index | Closer to 10,000 → more concentrated |
| Smooth distributional shape? | KDE (bandwidth selection) | Peaks, modes, tails |
| Are residuals Normal? | QQ plot + histogram of residuals | Points on diagonal; bell-shaped histogram |
| Are residuals independent? | ACF of residuals | No significant spikes beyond Bartlett bands |
| Is there serial structure in raw data? | ACF + PACF | Trend, seasonality, AR/MA order |
A logistics analyst models delivery time (hours) as a function of distance and package weight using linear regression. She inspects the diagnostic plots:
Use the selection table on the previous slide.
Descriptive Statistics & EDA — Lecture 2