Descriptive Statistics & EDA

Session roadmap

Scatterplot ──→ Pearson r ──→ Spearman ──→ Concentration ──→ Residual reuse ──→ Time structure
 visual pattern   linear association   rank association   concentration   residual/time checks
                    ╲                                              ╱
                     ╲── same EDA tools validate model residuals ─╱

One question drives the lecture: “How do variables relate — and are your model assumptions met?”

Pre-reading check

Hands up: true or false?

“Pearson correlation $r = 0.9$ means 90% of the variance in $Y$ is explained by $X$.” false
“Spearman correlation should be used when the relationship between two variables is monotone but not necessarily linear.” true
“If residuals from a regression model show a pattern in the ACF plot, this is not a problem as long as the model fits well overall.” false

Q1 (false): This is one of the most common errors in student reports. $r$ and $R^2$ are related by squaring, not equal. With $r = 0.9$, $R^2 = 0.81$, meaning 81% (not 90%) of variance is explained. Emphasise: “always square $r$ before claiming variance explained.”

Q2 (true): Most students get this right after reading. Worth a quick check: “What happens to ordinal data if you apply Pearson?” → Assumes equal intervals, which may not hold. Spearman avoids this assumption.

Q3 (false): ACF patterns in residuals violate the independence assumption, so t-tests and confidence intervals are invalid even if the model seems to fit. Most students say “true” on first attempt — they conflate “model fit” with “valid inference.” This is the session’s central conceptual trap. Set up the ACF section: “We’ll see in the airline data exactly why this matters.”

Keep Q3 in mind — we will demonstrate this with time series data near the end of the session.

From one variable to two: scatterplot as the starting point

Scatterplot: each observation is a point $(x_i, y_i)$. The most information-rich bivariate display.

What to look for: - Direction: positive or negative association? - Form: linear or non-linear (curved, step-shaped)? - Strength: tight cluster around a line, or dispersed cloud? - Outliers: points far from the general pattern?

The scatterplot comes first. Do not compute a correlation coefficient before looking at the plot — a single number can miss all four questions above.

Story: coffee prices USA vs. Colombia

Monthly coffee commodity prices: USA market and Colombia market, 1990–2020.

USA price (USD/lb)
   4 ┤                                      ╭──
   3 ┤                               ╭─────╯
   2 ┤            ╭──────────────────╯
   1 ┤────────────╯
     └──────────────────────────────────────
         1990        2000        2010        2020

Colombia price — similar shape but different level and currency.

Hypothesis: two markets trading the same commodity should track together — but how tightly? What breaks the correlation?

Pearson $r \approx 0.7$ — moderate positive. But is it linear? Are there periods of divergence?

Setting the scene: “You are a commodity trader. You have daily price data for Arabica coffee in New York (NYMEX) and in Bogotá. You want to know: how correlated are they? Can you hedge one with the other? Does the correlation stay stable over time?”

Why r = 0.7 is informative but incomplete: The scatterplot might reveal that the correlation was 0.9 in the 1990s but dropped to 0.5 after a Colombian supply shock. Pearson r over the full dataset hides this structural break.

What breaks the correlation: Weather shocks (frost in Colombia), currency fluctuations (COP/USD), geopolitical events, NYMEX speculation. These create periods where the markets diverge temporarily.

App preview: “Let’s look at the scatterplot matrix for this dataset and compute both Pearson and Spearman correlations.”

Scatterplot + correlation formula — annotated

Pearson correlation coefficient:

\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{(n-1)\, s_x\, s_y} = \frac{\text{Cov}(X,Y)}{s_X \cdot s_Y} \]

Range: $-1 \leq r \leq 1$
$r = 0$ means no linear association (not “no association”)
$r^2$ = coefficient of determination = proportion of variance in $Y$ explained by $X$

t-test for $r$: \[ t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2} \quad \text{under } H_0: \rho = 0 \]

The $r^2$ trap (Q1 from pre-reading check): State explicitly: “If I tell you $r = 0.9$, you must not say ‘90% variance explained.’ You say ‘81% variance explained.’ Squaring changes everything.”

t-test interpretation: “The t-test for $r$ tests whether the correlation is significantly different from zero. With $n = 1000$, even $r = 0.06$ is statistically significant at 5% — but practically irrelevant. Always report both $r$ and $r^2$, and comment on practical significance.”

Direction vs. magnitude: “$r = -0.8$ is stronger than $r = +0.5$” — many students confuse sign with strength. Strength = $|r|$.

Cov(X,Y) formula: “The numerator is the average cross-product of deviations. If X is above its mean when Y is above its mean, the product is positive — this is what drives positive correlation.”

Scatterplot + Pearson + Spearman app

Task — coffee prices: Colombia vs. USA, pre-loaded

Inspect the scatterplot. Does the relationship look linear? Are there periods of divergence?
Record Pearson $r$ and $R^2$. What percentage of USA price variance is explained by Colombia price?
Compare Pearson $r$ to Spearman $\rho$. Which is larger? What does the difference tell you about linearity?
Is the spread of points constant across the price range, or does it fan out? What does this imply for a regression model?

Open app in a new tab

What the app shows: Scatterplot of Colombia vs. USA coffee prices with Pearson and Spearman correlations. The coffee dataset (two columns: Colombia, USA) is pre-loaded via bookmark.

After the live app work, direct students to Exercise 01 for a written follow-up.

Expected results (coffee USA vs. Colombia): - Pearson r ≈ 0.7; R² ≈ 0.49 (about 49% of variance explained) - Spearman ρ slightly higher (≈ 0.75) because the relationship is monotone but mildly non-linear

Task 3 interpretation: “Spearman > Pearson means the rank relationship is tighter than the linear relationship. The data has mild non-linearity or influential extreme observations that Pearson penalises but Spearman ignores.”

Task 4: The spread of points increases as prices rise (heteroskedasticity). This matters for regression: a simple linear regression would have non-constant residual variance, violating the homoskedasticity assumption. Log-transforming prices often stabilises the variance in commodity price data.

Debrief: “Can you use the USA price to predict the Colombia price? What would you need to check before running a regression?” → Check linearity (Task 1), homoskedasticity (Task 4), and whether the relationship is stable over time.

Pearson: assumptions and when it fails

Pearson r is valid when: 1. Both variables are continuous (interval or ratio scale) 2. The relationship is approximately linear 3. No extreme outliers (one outlier can shift $r$ by ±0.2 in small samples) 4. (For inference only) Both variables approximately normally distributed

When Pearson fails — use alternatives: - Non-linear monotone relationship → Spearman $\rho$ - Heavy tails / outliers → Spearman $\rho$ or Kendall’s $\tau$ - Controlling for a third variable → Partial correlation - Categorical variable → Cramér’s V (nominal) or rank biserial

Linearity check: “Always plot first. If the scatterplot shows a clear curve (e.g., diminishing returns), Pearson r will underestimate the strength of the association. Spearman handles monotone curves.”

Outlier effect demonstration (verbal): “One data point can swing Pearson r from 0.2 to 0.7. That’s not a robust statistic. Spearman rank correlation is not affected by a single outlier because ranking clips the extreme value.”

Partial correlation (brief mention): “Partial correlation controls for the effect of a third variable. If coffee price is correlated with oil price, and oil price is correlated with shipping costs, the partial correlation between coffee and shipping controls for oil. This is covered in partialcorr.qmd.”

Transition: “Let’s formally distinguish Spearman from Pearson.”

Rank correlation: Spearman vs. Kendall’s tau

Spearman $\rho$: Replace each value with its rank, then compute Pearson on the ranks.

\[ \rho_S = 1 - \frac{6 \sum d_i^2}{n(n^2-1)} \quad \text{where } d_i = \text{rank}(x_i) - \text{rank}(y_i) \]

Kendall’s $\tau$: Counts concordant pairs ($C$) minus discordant pairs ($D$):

\[ \tau = \frac{C - D}{\binom{n}{2}} \]

Property	Spearman	Kendall
Sensitive to all monotone relations	✓	✓
More robust to outliers in ranks	moderate	high
Interpretable as probability	✗	✓

$\tau$ has a direct probabilistic interpretation: probability that a randomly chosen pair is concordant minus the probability it is discordant.

Spurious correlations — the trap

Tyler Vigen’s database (tylervigen.com): Nicolas Cage film releases vs. swimming pool drownings — $r \approx 0.67$, $p < 0.05$.

Year:            2000  2002  2004  2006  2008  2010
Nicolas Cage films: 2    3     4     2     3     2
Pool drownings:   109  102  117   104  113  103

Why does this happen? Both time series trend together or share seasonal structure driven by a common third factor (e.g., summer activity levels, year-on-year economic growth).

Lesson: correlation is not causation. A statistically significant $r$ requires: - A plausible causal mechanism - Ruling out confounders - Temporal precedence (cause before effect)

The database: Tyler Vigen found hundreds of such correlations by scraping US statistical agencies. Key examples: cheese consumption vs. deaths by bedsheet tangling; number of lawyers in Nevada vs. car crash deaths in Wisconsin.

Why $p < 0.05$? With two trending or seasonal time series, Pearson $r$ inflates because neither series is independent — they share the common time trend. The effective sample size is much smaller than the number of data points. The p-value for correlation assumes independent observations — violated when the data has serial structure.

Granger causality (brief mention): In time series, Granger causality tests whether X at time $t$ helps predict Y at time $t+1$, after controlling for lagged Y. This is not true causation, but it is more defensible than cross-sectional correlation.

Practical test for confounding: “Before claiming X causes Y, list every plausible common cause. Can you rule them out with the data you have?”

Story: Concentration — who owns how much?

Global wealth distribution (Credit Suisse Wealth Report, 2023):

Top 1% own approximately 43% of global wealth
Top 10% own approximately 76% of global wealth
Bottom 50% own approximately 2% of global wealth

Lorenz curve (illustrative):
     ╭──────────── Perfect equality
    ╱
   ╱  ╭─────────── Actual distribution
  ╱  ╱
 ╱  ╱   ← Gini area = A/(A+B)
╱──╱
└────────
  population (poorest → richest)

Gini coefficient ≈ 0.85 for global wealth. How concentrated is this? You are about to find out.

Setting the scene: “Imagine ranking every person on earth from poorest to richest. We plot: the bottom x% of people own y% of wealth. If wealth were equally distributed, we’d have a 45° line. The further the actual curve bows below the 45° line, the more concentrated wealth is.”

Gini interpretation scale: - 0 = perfect equality (everyone has the same) - 1 = maximum inequality (one person has everything) - UK income Gini ≈ 0.35; UK wealth Gini ≈ 0.60; Global wealth Gini ≈ 0.85

Why wealth is more concentrated than income: Income is a flow (earned each year); wealth is a stock (accumulated over a lifetime and inherited). Compounding amplifies initial advantages.

Policy relevance: “The Gini coefficient is a single number that summarises an entire distribution. But two distributions with the same Gini can look very different — the Lorenz curve shows the full picture.”

Gini coefficient and Lorenz curve — annotated

Lorenz curve: plot the cumulative population proportion ($x$-axis) against cumulative wealth share ($y$-axis), sorted from poorest to richest.

Gini coefficient: \[ G = 1 - 2\int_0^1 L(x)\, dx \;=\; \frac{\text{Area between curve and equality line}}{\text{Total area under equality line}} \]

Herfindahl-Hirschman Index (HHI): concentration in a market with $n$ firms: \[ \text{HHI} = \sum_{i=1}^n s_i^2 \quad \text{where } s_i = \text{market share of firm } i \]

HHI → 0: many small firms (competitive)
HHI → 10{,}000 (in %, squared): monopoly

Gini derivation (verbal): “If I pick two random people from the population and compute their absolute income difference, the Gini is half the expected value of that difference, divided by the mean income. This is the ‘mean absolute difference’ interpretation.”

HHI in antitrust law: The US Department of Justice uses HHI to evaluate mergers. A post-merger HHI > 2,500 (in market share percentage points squared) is presumed anticompetitive. E.g., if four firms have shares 40%, 30%, 20%, 10%: HHI = 1600 + 900 + 400 + 100 = 3000 → concentrated market.

Gini vs. HHI: Gini is used for populations of individuals (income/wealth distribution). HHI is used for industries (market concentration). Both measure inequality; different interpretations.

Transition: “Let’s verify the global wealth numbers in the app.”

Concentration app

Task — stylised wealth distribution pre-loaded (20 households, values in $k)

Read the Gini coefficient. Is this distribution closer to equality or maximum concentration?
Inspect the Lorenz curve. Approximately what share do the bottom 50% hold? The top 10%?
Replace the data with 20 equal values (e.g., type 100 repeated 20 times). What happens to the Gini and the Lorenz curve?
Restore the original data. Which single value drives most of the concentration? What does removing it do to the Gini?

Open app in a new tab

What the app shows: Lorenz curve and Gini coefficient. Pre-loaded data: 20 stylised wealth values (in $k) representing a highly skewed distribution — 1, 2, 3, 4, 5, 7, 10, 15, 22, 35, 55, 90, 150, 250, 400, 700, 1200, 2500, 6000, 25000. Total ≈ $36,500k.

After the live app work, direct students to Exercise 03 for a written follow-up.

Expected results: - Gini ≈ 0.85: extremely concentrated. The top value (25000) represents ~68% of total wealth. - Bottom 10 values (bottom 50%) sum to ≈ 104 out of 36,500 — less than 0.3% of total. - Lorenz curve hugs the bottom-right corner tightly.

Task 3 (equal distribution): Replacing all values with equal amounts gives Gini = 0 — the Lorenz curve becomes the 45° equality line.

Task 4 (remove top value): Removing 25000 leaves total ≈ 11,500. The top value was responsible for roughly half the Gini. This illustrates how a single extreme observation drives concentration statistics.

Connect to global wealth: “Real global wealth Gini ≈ 0.85 — similar to what you’re seeing here. The top 1% of global wealth holders own roughly 43% of all wealth.”

Debrief: “If you were advising a government on wealth redistribution policy, would you use the Gini coefficient alone? What additional information would you want?” → Gini hides the shape of the Lorenz curve. Two distributions can have the same Gini but differ in where the inequality concentrates (bottom vs. top).

KDE: the smooth histogram

Kernel Density Estimate (KDE): places a small kernel (usually Gaussian) at each data point and sums them.

\[ \hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right) \]

$h$ = bandwidth (controls smoothness): too small → spiky; too large → oversmoothed
Optimal bandwidth (Silverman’s rule): $h = 1.06\, s\, n^{-1/5}$

Advantages over histogram: smooth; no bin boundary artefacts; works for overlaying multiple groups.

Extensions: bivariate KDE (density plot of two variables); conditional KDE. See handbook chapters bidensity and conditionaleda.

The residual diagnostic workflow

After fitting any regression or time series model, run these five checks in order:

Step	Plot	Assumption checked
1	Histogram of residuals	Residuals approximately Normal
2	QQ plot of residuals	Normality — formal visual test
3	Residuals vs. fitted values	Linearity & constant variance (homoskedasticity)
4	Residuals vs. each predictor	No missed non-linear relationships
5	ACF of residuals	Independence — no serial correlation

If any check fails: the associated inference (t-tests, F-test, CIs) is unreliable. Do not report p-values from a model with violated assumptions.

This is the most practically important slide in both lectures. Students will run regression in the next unit. Everything they know about EDA maps directly to these five checks.

Walk-through: 1. Histogram: should look roughly bell-shaped, centred at 0. 2. QQ plot: should show points on the diagonal. S-curve → non-normal tails. 3. Residuals vs. fitted: should show a random cloud around 0. A U-shape means the linear model is wrong (try adding $X^2$). A fan shape means heteroskedasticity (try log Y). 4. Residuals vs. each predictor: same as above but for each input variable. 5. ACF: should show no significant spikes beyond lag 0. Spikes → autocorrelated errors → your SEs are wrong.

Practical note: “In R, plot(model) produces plots 1–4 automatically. The ACF must be added manually: acf(residuals(model)).”

Residual check: histogram → QQ → scatter → ACF

What to look for in each plot:

Histogram of residuals: - ✓ Roughly bell-shaped, centred at 0 - ✗ Heavy tails, strong skewness → transform Y or use robust regression

QQ plot of residuals: - ✓ Points on the diagonal - ✗ S-curve (both tails deviate) → leptokurtic residuals

Residuals vs. fitted: - ✓ Random cloud, no pattern, constant spread - ✗ U-shape → missed non-linearity (add polynomial term) - ✗ Fan shape (spread increases) → heteroskedasticity (try log Y)

ACF of residuals: - ✓ All lags within Bartlett 95% CI bands ($\pm 2/\sqrt{n}$) - ✗ Spike at lag 1 and decaying → AR(1) autocorrelation → wrong SEs

Why order matters: Run histogram and QQ first (normality) because non-normality can look like heteroskedasticity in the scatter plot. If you fix the non-linearity first (plot 3), the normality issue often resolves.

What to do if ACF fails: “Your observations are not independent — likely because they are ordered in time. You need to either model the autocorrelation explicitly (an AR or ARMA model) or use heteroskedasticity and autocorrelation consistent (HAC) standard errors.”

Common student error: “My residual histogram looks roughly normal, so I’m fine.” → Check the QQ plot too. A histogram can hide tail behaviour that the QQ plot makes obvious.

Practical threshold: “I usually tolerate slight skewness in large samples (CLT) and a single outlier point on the QQ plot. What I cannot ignore: a systematic S-curve, a clear U-shape in residuals vs. fitted, or a large spike in the ACF.”

Time series EDA: what to look for before modeling

Before fitting any time series model, describe the raw series:

Trend: is the level rising, falling, or flat over time?
Seasonality: regular periodic cycles (annual, quarterly, weekly)?
Irregular / residual: unexplained variation after removing trend and seasonality?

ACF (Autocorrelation Function): $\rho_k = \text{Corr}(Y_t, Y_{t-k})$ — how correlated is a series with its own lag-$k$ past.

PACF (Partial ACF): $\phi_{kk}$ — correlation at lag $k$ after controlling for lags 1 through $k-1$.

Model identification rules (brief): - AR($p$): ACF decays, PACF cuts off after lag $p$ - MA($q$): ACF cuts off after lag $q$, PACF decays

Why “brief”? A full treatment of time series (stationarity, differencing, ARIMA, STL decomposition) is in a later unit. Today’s goal is: students can read ACF/PACF plots and recognise the patterns — including in residuals from regression models.

Bartlett bands: The dashed lines at $\pm 2/\sqrt{n}$ are approximate 95% confidence intervals under the null hypothesis of no autocorrelation. Spikes beyond these bands indicate significant autocorrelation.

Airline data (foreshadow): Monthly passengers 1949–1960. Strong upward trend + seasonal spikes every 12 lags in the ACF. “This is what a raw series ACF looks like when you have trend and seasonality. Your model’s residuals should look nothing like this — their ACF should be flat.”

SMP, chapvrm, spectrum: Brief mention only: “The handbook covers spectral analysis and seasonality decomposition in chapters SMP and chapvrm. These are full time series tools — we defer them to the time series unit.”

Time series + ACF/PACF app

Task — load airline passenger data

Plot the raw time series. Describe trend and seasonality in words.
Inspect the ACF plot. Which lags are significant? What do spikes at lags 12, 24, 36 indicate?
Inspect the PACF. After which lag does it cut off? What AR order does this suggest?
Bridge to residuals: “If my model’s residuals had this ACF pattern, what would that mean?”
(Debrief) What transformation would you apply first before modeling? Why?

Open app in a new tab

What the app shows: Time series plot, ACF, and PACF for user-selected datasets. Bartlett confidence bands overlaid.

After the live app work, direct students to Exercise 05 for a written follow-up.

Expected results (airline passengers): - Raw series: strong upward trend + seasonal peaks each summer (multiplicative seasonality — the seasonal swings grow with the level) - ACF: slowly decaying, significant spikes at lags 12, 24, 36 (annual seasonality) — classic non-stationary pattern - PACF: large spike at lag 1, possibly lag 2, then mostly within bands — suggests AR(1) or AR(2) structure after differencing

Task 4 (bridge): If model residuals showed this ACF pattern: “My residuals are correlated — they are not independent. Each residual is partly predictable from the previous one. This means my standard errors are underestimated (I have less information than I think) and all my t-tests are wrong.”

Task 5 (transformation): Log transformation stabilizes the multiplicative seasonal pattern. First difference (or seasonal difference) removes the trend and seasonality. After transformation, the ACF should show no trend-like decay.

Keep this brief: 8 minutes total. The goal is ACF/PACF pattern recognition, not ARIMA model building.

Selection guide walkthrough (live demo)

The handbook’s interactive selection guide maps your data situation to the appropriate method.

Live demo: Use the constraint picker to answer: - “I have two continuous variables and want to measure their association.” → Pearson (if linear, normal) or Spearman (if non-linear or skewed) - “I want to describe inequality in a distribution.” → Gini coefficient + Lorenz curve - “I want to check whether my regression residuals are autocorrelated.” → ACF plot of residuals

Rule: If you are unsure which method to use — consult the selection guide before choosing.

Key ideas — selection table (L2)

Question	Tool	What you read
Two continuous variables — association?	Scatterplot first, then Pearson $r$	Direction, form, strength, outliers; $r^2$ = variance explained
Non-linear or ranked data — association?	Spearman $\rho$ or Kendall $\tau$	Monotone association; $\tau$ has probabilistic interpretation
Inequality in a distribution?	Gini coefficient + Lorenz curve	0 = equality, 1 = maximum concentration; area under Lorenz
Market concentration?	Herfindahl-Hirschman Index	Closer to 10,000 → more concentrated
Smooth distributional shape?	KDE (bandwidth selection)	Peaks, modes, tails
Are residuals Normal?	QQ plot + histogram of residuals	Points on diagonal; bell-shaped histogram
Are residuals independent?	ACF of residuals	No significant spikes beyond Bartlett bands
Is there serial structure in raw data?	ACF + PACF	Trend, seasonality, AR/MA order

Exit problem (pairs, 5 min)

A logistics analyst models delivery time (hours) as a function of distance and package weight using linear regression. She inspects the diagnostic plots:

Plot A: residuals vs. fitted values — shows a clear U-shape
Plot B: QQ plot of residuals — points deviate from the diagonal in both tails
Plot C: ACF of residuals — spike at lag 1 ($\rho_1 \approx 0.6$) and decreasing spikes thereafter

Which assumption does Plot A violate? What should she do?
Which assumption does Plot B suggest may be violated?
What does Plot C tell her, and why does it matter for inference?
She also finds Gini = 0.72 for delivery times across routes. What does this say?

Use the selection table on the previous slide.

Pairs: 5 minutes. Walk around and listen.

Expected answers:

Plot A — linearity violated. The U-shape means the true relationship between delivery time and the predictors is non-linear — the linear model systematically under-predicts at extreme fitted values and over-predicts in the middle. Fix: add a polynomial term (distance²) or transform the response (log delivery time).
Plot B — normality of residuals questionable. Deviation in both tails indicates the residual distribution is heavier-tailed than Normal (leptokurtic). Fix: use robust regression or transform the response. If the sample is large (CLT), this may be tolerable for the mean estimates but not for prediction intervals.
Plot C — residuals are autocorrelated (not independent). A spike at lag 1 with $\rho_1 = 0.6$ means each residual is correlated with the previous one. This could happen if deliveries are ordered by route or date and there is a systematic spatial or temporal trend not captured by distance and weight. Consequence: standard errors are underestimated; all t-tests and confidence intervals are invalid. Fix: add a missing predictor, include a time/route trend, or use HAC standard errors.
Gini = 0.72 → highly concentrated delivery times. Some routes take much longer than others — the average delivery time is not representative of most routes. The top routes (few very slow ones) dominate the distribution. This suggests segmenting routes (urban vs. rural, air vs. ground) before modeling.

After 3 minutes: Ask one pair for Q1, then Q3. End by connecting Q3 back to the pre-reading check: “This is why autocorrelated residuals are not just a technical nuance — they invalidate the inference you report to your client.”

Question	Tool	What you read
Two continuous variables — association?	Scatterplot first, then Pearson \(r\)	Direction, form, strength, outliers; \(r^2\) = variance explained
Non-linear or ranked data — association?	Spearman \(\rho\) or Kendall \(\tau\)	Monotone association; \(\tau\) has probabilistic interpretation
Inequality in a distribution?	Gini coefficient + Lorenz curve	0 = equality, 1 = maximum concentration; area under Lorenz
Market concentration?	Herfindahl-Hirschman Index	Closer to 10,000 → more concentrated
Smooth distributional shape?	KDE (bandwidth selection)	Peaks, modes, tails
Are residuals Normal?	QQ plot + histogram of residuals	Points on diagonal; bell-shaped histogram
Are residuals independent?	ACF of residuals	No significant spikes beyond Bartlett bands
Is there serial structure in raw data?	ACF + PACF	Trend, seasonality, AR/MA order

Descriptive Statistics & EDA — Lecture 2

Session roadmap

Pre-reading check

From one variable to two: scatterplot as the starting point

Story: coffee prices USA vs. Colombia

Scatterplot + correlation formula — annotated

Scatterplot + Pearson + Spearman app

Pearson: assumptions and when it fails

Rank correlation: Spearman vs. Kendall’s tau

Spurious correlations — the trap

Story: Concentration — who owns how much?

Gini coefficient and Lorenz curve — annotated

Concentration app

KDE: the smooth histogram

The residual diagnostic workflow

Residual check: histogram → QQ → scatter → ACF

Time series EDA: what to look for before modeling

Time series + ACF/PACF app

Selection guide walkthrough (live demo)

Key ideas — selection table (L2)

Exit problem (pairs, 5 min)