• Descriptive
    • Moments
    • Concentration
    • Central Tendency
    • Variability
    • Stem-and-Leaf Plot
    • Histogram & Frequency Table
    • Data Quality Forensics
    • Conditional EDA
    • Quantiles
    • Kernel Density Estimation
    • Normal QQ Plot
    • Bootstrap Plot

    • Multivariate Descriptive Statistics
  • Distributions
    • Binomial Probabilities
    • Geometric Probabilities
    • Negative Binomial Probabilities
    • Hypergeometric Probabilities
    • Multinomial Probabilities
    • Dirichlet
    • Poisson Probabilities

    • Exponential
    • Gamma
    • Erlang
    • Weibull
    • Rayleigh
    • Maxwell-Boltzmann
    • Lognormal
    • Pareto
    • Inverse Gamma
    • Inverse Chi-Square

    • Beta
    • Power
    • Beta Prime (Inv. Beta)
    • Triangular

    • Normal (area)
    • Logistic
    • Laplace
    • Cauchy (standard)
    • Cauchy (location-scale)
    • Gumbel
    • Fréchet
    • Generalized Extreme Value

    • Normal RNG
    • ML Fitting
    • Tukey Lambda PPCC
    • Box-Cox Normality Plot
    • Noncentral t
    • Noncentral F
    • Sample Correlation r

    • Empirical Tests
  • Hypotheses
    • Theoretical Aspects of Hypothesis Testing
    • Bayesian Inference
    • Minimum Sample Size

    • Empirical Tests
    • Multivariate (pair-wise) Testing
  • Models
    • Manual Model Building
    • Guided Model Building
  • Time Series
    • Time Series Plot
    • Decomposition
    • Exponential Smoothing

    • Blocked Bootstrap Plot
    • Mean Plot
    • (P)ACF
    • VRM
    • Standard Deviation-Mean Plot
    • Spectral Analysis
    • ARIMA

    • Cross Correlation Function
    • Granger Causality
  1. Getting Started
  2. 4  The Big Picture: Why We Analyze Data
  • Preface
  • Getting Started
    • 1  Introduction
    • 2  Why Do We Need Innovative Technology?
    • 3  Basic Definitions
    • 4  The Big Picture: Why We Analyze Data
  • Introduction to Probability
    • 5  Definitions of Probability
    • 6  Jeffreys’ axiom system
    • 7  Bayes’ Theorem
    • 8  Sensitivity and Specificity
    • 9  Naive Bayes Classifier
    • 10  Law of Large Numbers

    • 11  Problems
  • Probability Distributions
    • 12  Bernoulli Distribution
    • 13  Binomial Distribution
    • 14  Geometric Distribution
    • 15  Negative Binomial Distribution
    • 16  Hypergeometric Distribution
    • 17  Multinomial Distribution
    • 18  Poisson Distribution

    • 19  Uniform Distribution (Rectangular Distribution)
    • 20  Normal Distribution (Gaussian Distribution)
    • 21  Gaussian Naive Bayes Classifier
    • 22  Chi Distribution
    • 23  Chi-squared Distribution (1 parameter)
    • 24  Chi-squared Distribution (2 parameters)
    • 25  Student t-Distribution
    • 26  Fisher F-Distribution
    • 27  Exponential Distribution
    • 28  Lognormal Distribution
    • 29  Gamma Distribution
    • 30  Beta Distribution
    • 31  Weibull Distribution
    • 32  Pareto Distribution
    • 33  Inverse Gamma Distribution
    • 34  Rayleigh Distribution
    • 35  Erlang Distribution
    • 36  Logistic Distribution
    • 37  Laplace Distribution
    • 38  Gumbel Distribution
    • 39  Cauchy Distribution
    • 40  Triangular Distribution
    • 41  Power Distribution
    • 42  Beta Prime Distribution
    • 43  Sample Correlation Distribution
    • 44  Dirichlet Distribution
    • 45  Generalized Extreme Value (GEV) Distribution
    • 46  Frechet Distribution
    • 47  Noncentral t Distribution
    • 48  Noncentral F Distribution
    • 49  Inverse Chi-Squared Distribution
    • 50  Maxwell-Boltzmann Distribution
    • 51  Distribution Relationship Map

    • 52  Problems
  • Descriptive Statistics & Exploratory Data Analysis
    • 53  Types of Data
    • 54  Datasheets

    • 55  Frequency Plot (Bar Plot)
    • 56  Frequency Table
    • 57  Contingency Table
    • 58  Binomial Classification Metrics
    • 59  Confusion Matrix
    • 60  ROC Analysis

    • 61  Stem-and-Leaf Plot
    • 62  Histogram
    • 63  Data Quality Forensics
    • 64  Quantiles
    • 65  Central Tendency
    • 66  Variability
    • 67  Skewness & Kurtosis
    • 68  Concentration
    • 69  Notched Boxplot
    • 70  Scatterplot
    • 71  Pearson Correlation
    • 72  Rank Correlation
    • 73  Partial Pearson Correlation
    • 74  Simple Linear Regression
    • 75  Moments
    • 76  Quantile-Quantile Plot (QQ Plot)
    • 77  Normal Probability Plot
    • 78  Probability Plot Correlation Coefficient Plot (PPCC Plot)
    • 79  Box-Cox Normality Plot
    • 80  Kernel Density Estimation
    • 81  Bivariate Kernel Density Plot
    • 82  Conditional EDA: Panel Diagnostics
    • 83  Bootstrap Plot (Central Tendency)
    • 84  Survey Scores Rank Order Comparison
    • 85  Cronbach Alpha

    • 86  Equi-distant Time Series
    • 87  Time Series Plot (Run Sequence Plot)
    • 88  Mean Plot
    • 89  Blocked Bootstrap Plot (Central Tendency)
    • 90  Standard Deviation-Mean Plot
    • 91  Variance Reduction Matrix
    • 92  (Partial) Autocorrelation Function
    • 93  Periodogram & Cumulative Periodogram

    • 94  Problems
  • Hypothesis Testing
    • 95  Normal Distributions revisited
    • 96  The Population
    • 97  The Sample
    • 98  The One-Sided Hypothesis Test
    • 99  The Two-Sided Hypothesis Test
    • 100  When to use a one-sided or two-sided test?
    • 101  What if \(\sigma\) is unknown?
    • 102  The Central Limit Theorem (revisited)
    • 103  Statistical Test of the Population Mean with known Variance
    • 104  Statistical Test of the Population Mean with unknown Variance
    • 105  Statistical Test of the Variance
    • 106  Statistical Test of the Population Proportion
    • 107  Statistical Test of the Standard Deviation \(\sigma\)
    • 108  Statistical Test of the difference between Means -- Independent/Unpaired Samples
    • 109  Statistical Test of the difference between Means -- Dependent/Paired Samples
    • 110  Statistical Test of the difference between Variances -- Independent/Unpaired Samples

    • 111  Hypothesis Testing for Research Purposes
    • 112  Decision Thresholds, Alpha, and Confidence Levels
    • 113  Bayesian Inference for Decision-Making
    • 114  One Sample t-Test
    • 115  Skewness & Kurtosis Tests
    • 116  Paired Two Sample t-Test
    • 117  Wilcoxon Signed-Rank Test
    • 118  Unpaired Two Sample t-Test
    • 119  Unpaired Two Sample Welch Test
    • 120  Two One-Sided Tests (TOST) for Equivalence
    • 121  Mann-Whitney U test (Wilcoxon Rank-Sum Test)
    • 122  Bayesian Two Sample Test
    • 123  Median Test based on Notched Boxplots
    • 124  Chi-Squared Tests for Count Data
    • 125  Kolmogorov-Smirnov Test
    • 126  One Way Analysis of Variance (1-way ANOVA)
    • 127  Kruskal-Wallis Test
    • 128  Two Way Analysis of Variance (2-way ANOVA)
    • 129  Repeated Measures ANOVA
    • 130  Friedman Test
    • 131  Testing Correlations
    • 132  A Note on Causality

    • 133  Problems
  • Regression Models
    • 134  Simple Linear Regression Model (SLRM)
    • 135  Multiple Linear Regression Model (MLRM)
    • 136  Logistic Regression
    • 137  Generalized Linear Models
    • 138  Multinomial and Ordinal Logistic Regression
    • 139  Cox Proportional Hazards Regression
    • 140  Conditional Inference Trees
    • 141  Leaf Diagnostics for Conditional Inference Trees
    • 142  Conditional Random Forests
    • 143  Hypothesis Testing with Linear Regression Models (from a Practical Point of View)

    • 144  Problems
  • Introduction to Time Series Analysis
    • 145  Case: the Market of Health and Personal Care Products
    • 146  Decomposition of Time Series
    • 147  Ad hoc Forecasting of Time Series
  • Box-Jenkins Analysis
    • 148  Introduction to Box-Jenkins Analysis
    • 149  Theoretical Concepts
    • 150  Stationarity
    • 151  Identifying ARMA parameters
    • 152  Estimating ARMA Parameters and Residual Diagnostics
    • 153  Forecasting with ARIMA models
    • 154  Intervention Analysis
    • 155  Cross-Correlation Function
    • 156  Transfer Function Noise Models
    • 157  General-to-Specific Modeling
  • Model Building Strategies
    • 158  Introduction to Model Building Strategies
    • 159  Manual Model Building
    • 160  Model Validation
    • 161  Regularization Methods
    • 162  Hyperparameter Optimization Strategies
    • 163  Guided Model Building in Practice
    • 164  Diagnostics, Revision, and Guided Forecasting
    • 165  Leakage, Target Encoding, and Robust Regression
  • References
  • Appendices
    • Appendices
    • A  Method Selection Guide
    • B  Presentations and Teaching Materials
    • C  R Language Concepts for Statistical Computing
    • D  Matrix Algebra
    • E  Standard Normal Table (Gaussian Table)
    • F  Critical values of Student’s \(t\) distribution with \(\nu\) degrees of freedom
    • G  Upper-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom
    • H  Lower-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom

Table of contents

  • 4.1 Four Purposes of Statistical Analysis
    • 4.1.1 Purpose 1: Description
    • 4.1.2 Purpose 2: Exploration
    • 4.1.3 Purpose 3: Inference
    • 4.1.4 Purpose 4: Prediction
  • 4.2 The Same Data, Different Purposes
  • 4.3 Why Purpose Matters: Two Critical Distinctions
    • 4.3.1 Inference vs. Prediction in Modeling
    • 4.3.2 Description vs. Exploration vs. Residual Analysis
  • 4.4 The Role of Probability
  • 4.5 Time Series: A Special Case
  • 4.6 Navigating This Book
  • 4.7 Before You Choose a Method
  • 4.8 Summary
  1. Getting Started
  2. 4  The Big Picture: Why We Analyze Data

4  The Big Picture: Why We Analyze Data

Before diving into formulas and methods, let’s step back and ask a simple question: Why do we analyze data in the first place?

The answer to this question determines everything that follows — which methods to use, how to interpret results, and what conclusions we can draw. This chapter provides a roadmap of the entire book by exploring the different reasons people analyze data.

4.1 Four Purposes of Statistical Analysis

When someone collects data and wants to analyze it, they typically have one of four goals in mind:

  1. Describe: Summarize what the data looks like
  2. Explore: Discover patterns, problems, or interesting questions
  3. Infer: Test claims or draw conclusions about a larger population
  4. Predict: Forecast future observations or outcomes

These purposes are not mutually exclusive — a complete analysis often involves all four. But understanding which purpose drives your analysis helps you choose the right tools.

4.1.1 Purpose 1: Description

“What does my data look like?”

Sometimes we simply want to summarize a dataset. A company might want to know the average salary of its employees. A teacher might want to know the typical exam score in their class. A researcher might want to characterize the patients in a clinical trial.

Example: Summarizing Customer Ages

A retail store collects data on 500 customers and wants to describe their age distribution:

# Simulated customer ages
set.seed(42)
ages <- c(rnorm(300, mean = 35, sd = 10), rnorm(200, mean = 55, sd = 8))
ages <- pmax(18, pmin(ages, 80))  # Constrain to realistic range

# Descriptive summary
cat("Number of customers:", length(ages), "\n")
Number of customers: 500 
cat("Average age:", round(mean(ages), 1), "years\n")
Average age: 42.9 years
cat("Median age:", round(median(ages), 1), "years\n")
Median age: 42.7 years
cat("Youngest:", round(min(ages), 1), "years\n")
Youngest: 18 years
cat("Oldest:", round(max(ages), 1), "years\n")
Oldest: 78.7 years
cat("Standard deviation:", round(sd(ages), 1), "years\n")
Standard deviation: 13 years

The goal here is purely descriptive — we are summarizing this specific dataset, not making claims about customers in general or predicting future customer ages.

Descriptive methods covered in this book include measures of central tendency (mean, median, mode), variability (variance, standard deviation), and visualization techniques like histograms (Chapter 62) and box plots (Chapter 69).

4.1.2 Purpose 2: Exploration

“What interesting patterns or problems are hiding in my data?”

Exploratory Data Analysis (EDA) goes beyond description. The goal is to discover things we didn’t know to look for: unexpected patterns, outliers, data quality issues, or new research questions.

Example: Discovering a Bimodal Distribution

Looking at the same customer age data, a simple histogram reveals something interesting:

Code
hist(ages, breaks = 20, col = "steelblue", border = "white",
     main = "Customer Age Distribution",
     xlab = "Age (years)", ylab = "Number of customers")
Figure 4.1: Exploratory histogram reveals two distinct customer groups

The histogram reveals two peaks — the store seems to have two distinct customer groups (younger adults around 35 and older adults around 55). This was not obvious from the summary statistics alone.

This discovery might lead to new questions: Are these groups buying different products? Should marketing strategies differ for each group?

EDA is about asking questions, not answering them. It’s detective work. Tools for exploration include scatter plots, correlation analysis, box plots for comparing groups (Chapter 69), and various diagnostic plots.

4.1.3 Purpose 3: Inference

“Can I draw conclusions about a population based on my sample?”

Inference is about going beyond the data in hand. We use a sample to make claims about a larger population, and we quantify our uncertainty about those claims.

Example: Testing a Claim About Customer Satisfaction

A company claims that at least 80% of customers are satisfied with their service. A consumer organization surveys 200 randomly selected customers and finds that 148 (74%) report being satisfied. Is the company’s claim credible?

# Test the claim
satisfied <- 148
total <- 200
claimed_rate <- 0.80

# One-sample proportion test
result <- prop.test(satisfied, total, p = claimed_rate, alternative = "less")
cat("Observed satisfaction rate:", satisfied/total * 100, "%\n")
Observed satisfaction rate: 74 %
cat("Claimed rate:", claimed_rate * 100, "%\n")
Claimed rate: 80 %
cat("p-value:", round(result$p.value, 4), "\n")
p-value: 0.021 

The p-value tells us how likely we would observe 74% satisfaction (or less) if the true rate really were 80%. A small p-value suggests the company’s claim may not be accurate.

This is inference — we’re using sample data to evaluate a claim about all customers, not just the 200 we surveyed.

Inferential methods include hypothesis tests, confidence intervals, and the various tests covered in the Hypothesis Testing part of this book.

4.1.4 Purpose 4: Prediction

“What will happen next?”

Sometimes we want to forecast future observations or predict outcomes for new cases. The focus shifts from understanding why something happens to predicting what will happen.

Example: Predicting House Prices

A real estate company wants to predict house prices based on features like size, location, and age. They have data on 500 past sales and want to predict prices for new listings.

# Simulated house price data
set.seed(123)
n <- 500
size <- runif(n, 800, 3000)  # Square feet
age <- sample(0:50, n, replace = TRUE)  # Years
price <- 50000 + 100 * size - 500 * age + rnorm(n, 0, 30000)

houses <- data.frame(size, age, price)

# Fit a prediction model
model <- lm(price ~ size + age, data = houses)

# Predict price for a new house: 1800 sq ft, 10 years old
new_house <- data.frame(size = 1800, age = 10)
predicted_price <- predict(model, new_house)
cat("Predicted price for 1800 sq ft, 10-year-old house: $",
    format(round(predicted_price), big.mark = ","), "\n", sep = "")
Predicted price for 1800 sq ft, 10-year-old house: $223,513

Here we don’t care as much about why size and age affect price — we just want accurate predictions for new houses.

Prediction methods include regression models (Chapter 134, Chapter 135, Chapter 136), classification trees (Chapter 140), and time series forecasting (Chapter 153).

4.2 The Same Data, Different Purposes

The purpose of your analysis determines what you do with the data. Consider this example:

Scenario: A hospital collects data on 1,000 patients, including their age, weight, blood pressure, and whether they developed heart disease.

Table 4.1: Different purposes lead to different analyses
Purpose Question Approach
Describe What is the average blood pressure in our patient population? Calculate mean, median, standard deviation
Explore Are there unusual patterns or subgroups in the data? Create scatter plots, look for clusters, check for outliers
Infer Is high blood pressure associated with heart disease? Conduct a hypothesis test comparing blood pressure between groups
Predict Which patients are likely to develop heart disease? Build a classification model using logistic regression or decision trees

4.3 Why Purpose Matters: Two Critical Distinctions

4.3.1 Inference vs. Prediction in Modeling

When building a regression model, your purpose fundamentally changes how you evaluate success:

Inference focus (answering scientific questions):

  • You care about the coefficients: “Does smoking increase disease risk?”
  • You want p-values and confidence intervals
  • Interpretability is essential
  • A simpler model may be better even if it predicts slightly worse

Prediction focus (forecasting new cases):

  • You care about accuracy on new data
  • You evaluate using out-of-sample metrics (test set performance)
  • A complex model is fine if it predicts well
  • The coefficients themselves may be uninterpretable

Example: A pharmaceutical company studies whether a new drug reduces blood pressure.

  • For inference: They need to estimate the drug effect (coefficient) with a confidence interval and test whether it’s statistically significant. The model must be interpretable.
  • For prediction: If they just want to predict a patient’s blood pressure after treatment, they could use any model that predicts well — even a “black box” model.

4.3.2 Description vs. Exploration vs. Residual Analysis

Not all “looking at data” is the same:

Description: Summarizing what you observe

  • “The average customer spends €47 per visit”
  • Purpose: Reporting, characterization

Exploration: Discovering the unexpected

  • “There seem to be two distinct customer groups — I didn’t expect that”
  • Purpose: Generating hypotheses, finding patterns

Residual analysis: Checking model assumptions

  • “The residuals from my regression model show a curved pattern — the linear model may not be appropriate”
  • Purpose: Validating models, diagnosing problems

All three use similar tools (plots, summary statistics) but for different reasons. A histogram during exploration helps you discover patterns. The same histogram of residuals checks whether your model assumptions hold.

4.4 The Role of Probability

Before we can do inference, we need to understand probability and probability distributions. If you flip a coin 100 times and get 60 heads, is the coin biased? To answer this, you need to know what to expect from a fair coin — and that requires understanding the binomial distribution.

The probability chapters (Chapter 5 through the F Distribution) provide the foundation for all inferential methods. They answer questions like:

  • How much variation should we expect by chance?
  • What does a “typical” sample look like?
  • How do we quantify unusual observations?

4.5 Time Series: A Special Case

When data are collected over time, the observations are usually not independent — today’s value depends on yesterday’s. This requires specialized methods:

  • Time series plots to visualize temporal patterns
  • Decomposition (Chapter 146) to separate trend, seasonality, and noise
  • ARIMA models (Chapter 148) for forecasting

Time series analysis combines all four purposes: describing the series, exploring for patterns, testing for trends or seasonality, and predicting future values.

4.6 Navigating This Book

The book is organized to match these purposes:

Table 4.2: Book organization by purpose
Part Purpose Key Question
Introduction to Probability Foundation What should we expect by chance?
Probability Distributions Foundation What patterns does randomness follow?
Descriptive Statistics & EDA Describe & Explore What does the data look like?
Hypothesis Testing Infer Can we generalize from sample to population?
Regression Models Infer & Predict How are variables related? What will happen?
Time Series Analysis All four How do patterns unfold over time?

4.7 Before You Choose a Method

When you have data and want to analyze it, ask yourself:

  1. What is my goal? Am I describing, exploring, testing a hypothesis, or predicting?

  2. What am I assuming? Every method makes assumptions. Am I willing to accept them?

  3. What would convince me? Before looking at results, decide what would change your mind.

  4. Who is my audience? A scientific paper requires different evidence than a business report.

The detailed selection guide in Appendix A provides systematic help for choosing specific methods. But first, be clear about your purpose — the “how” follows from the “why.”

4.8 Summary

  • Statistical analysis serves four main purposes: description, exploration, inference, and prediction
  • The same data can be analyzed differently depending on your goal
  • Inference and prediction require different evaluation criteria
  • Description, exploration, and residual analysis use similar tools for different reasons
  • Understanding probability provides the foundation for inference
  • Time series data require specialized methods due to temporal dependence

The rest of this book provides the tools for each purpose. But always start with the question: Why am I analyzing this data?

3  Basic Definitions
Introduction to Probability

© 2026 Patrick Wessa. Provided as-is, without warranty.

Feedback: e-mail | Anonymous contributions: click to copy (Sats) | click to copy (XMR)

Cookie Preferences