• Descriptive
    • Moments
    • Concentration
    • Central Tendency
    • Variability
    • Stem-and-Leaf Plot
    • Histogram & Frequency Table
    • Data Quality Forensics
    • Conditional EDA
    • Quantiles
    • Kernel Density Estimation
    • Normal QQ Plot
    • Bootstrap Plot

    • Multivariate Descriptive Statistics
  • Distributions
    • Binomial Probabilities
    • Geometric Probabilities
    • Negative Binomial Probabilities
    • Hypergeometric Probabilities
    • Multinomial Probabilities
    • Dirichlet
    • Poisson Probabilities

    • Exponential
    • Gamma
    • Erlang
    • Weibull
    • Rayleigh
    • Maxwell-Boltzmann
    • Lognormal
    • Pareto
    • Inverse Gamma
    • Inverse Chi-Square

    • Beta
    • Power
    • Beta Prime (Inv. Beta)
    • Triangular

    • Normal (area)
    • Logistic
    • Laplace
    • Cauchy (standard)
    • Cauchy (location-scale)
    • Gumbel
    • Fréchet
    • Generalized Extreme Value

    • Normal RNG
    • ML Fitting
    • Tukey Lambda PPCC
    • Box-Cox Normality Plot
    • Noncentral t
    • Noncentral F
    • Sample Correlation r

    • Empirical Tests
  • Hypotheses
    • Theoretical Aspects of Hypothesis Testing
    • Bayesian Inference
    • Minimum Sample Size

    • Empirical Tests
    • Multivariate (pair-wise) Testing
  • Models
    • Manual Model Building
    • Guided Model Building
  • Time Series
    • Time Series Plot
    • Decomposition
    • Exponential Smoothing

    • Blocked Bootstrap Plot
    • Mean Plot
    • (P)ACF
    • VRM
    • Standard Deviation-Mean Plot
    • Spectral Analysis
    • ARIMA

    • Cross Correlation Function
    • Granger Causality
  1. Descriptive Statistics & Exploratory Data Analysis
  2. 94  Problems
  • Preface
  • Getting Started
    • 1  Introduction
    • 2  Why Do We Need Innovative Technology?
    • 3  Basic Definitions
    • 4  The Big Picture: Why We Analyze Data
  • Introduction to Probability
    • 5  Definitions of Probability
    • 6  Jeffreys’ axiom system
    • 7  Bayes’ Theorem
    • 8  Sensitivity and Specificity
    • 9  Naive Bayes Classifier
    • 10  Law of Large Numbers

    • 11  Problems
  • Probability Distributions
    • 12  Bernoulli Distribution
    • 13  Binomial Distribution
    • 14  Geometric Distribution
    • 15  Negative Binomial Distribution
    • 16  Hypergeometric Distribution
    • 17  Multinomial Distribution
    • 18  Poisson Distribution

    • 19  Uniform Distribution (Rectangular Distribution)
    • 20  Normal Distribution (Gaussian Distribution)
    • 21  Gaussian Naive Bayes Classifier
    • 22  Chi Distribution
    • 23  Chi-squared Distribution (1 parameter)
    • 24  Chi-squared Distribution (2 parameters)
    • 25  Student t-Distribution
    • 26  Fisher F-Distribution
    • 27  Exponential Distribution
    • 28  Lognormal Distribution
    • 29  Gamma Distribution
    • 30  Beta Distribution
    • 31  Weibull Distribution
    • 32  Pareto Distribution
    • 33  Inverse Gamma Distribution
    • 34  Rayleigh Distribution
    • 35  Erlang Distribution
    • 36  Logistic Distribution
    • 37  Laplace Distribution
    • 38  Gumbel Distribution
    • 39  Cauchy Distribution
    • 40  Triangular Distribution
    • 41  Power Distribution
    • 42  Beta Prime Distribution
    • 43  Sample Correlation Distribution
    • 44  Dirichlet Distribution
    • 45  Generalized Extreme Value (GEV) Distribution
    • 46  Frechet Distribution
    • 47  Noncentral t Distribution
    • 48  Noncentral F Distribution
    • 49  Inverse Chi-Squared Distribution
    • 50  Maxwell-Boltzmann Distribution
    • 51  Distribution Relationship Map

    • 52  Problems
  • Descriptive Statistics & Exploratory Data Analysis
    • 53  Types of Data
    • 54  Datasheets

    • 55  Frequency Plot (Bar Plot)
    • 56  Frequency Table
    • 57  Contingency Table
    • 58  Binomial Classification Metrics
    • 59  Confusion Matrix
    • 60  ROC Analysis

    • 61  Stem-and-Leaf Plot
    • 62  Histogram
    • 63  Data Quality Forensics
    • 64  Quantiles
    • 65  Central Tendency
    • 66  Variability
    • 67  Skewness & Kurtosis
    • 68  Concentration
    • 69  Notched Boxplot
    • 70  Scatterplot
    • 71  Pearson Correlation
    • 72  Rank Correlation
    • 73  Partial Pearson Correlation
    • 74  Simple Linear Regression
    • 75  Moments
    • 76  Quantile-Quantile Plot (QQ Plot)
    • 77  Normal Probability Plot
    • 78  Probability Plot Correlation Coefficient Plot (PPCC Plot)
    • 79  Box-Cox Normality Plot
    • 80  Kernel Density Estimation
    • 81  Bivariate Kernel Density Plot
    • 82  Conditional EDA: Panel Diagnostics
    • 83  Bootstrap Plot (Central Tendency)
    • 84  Survey Scores Rank Order Comparison
    • 85  Cronbach Alpha

    • 86  Equi-distant Time Series
    • 87  Time Series Plot (Run Sequence Plot)
    • 88  Mean Plot
    • 89  Blocked Bootstrap Plot (Central Tendency)
    • 90  Standard Deviation-Mean Plot
    • 91  Variance Reduction Matrix
    • 92  (Partial) Autocorrelation Function
    • 93  Periodogram & Cumulative Periodogram

    • 94  Problems
  • Hypothesis Testing
    • 95  Normal Distributions revisited
    • 96  The Population
    • 97  The Sample
    • 98  The One-Sided Hypothesis Test
    • 99  The Two-Sided Hypothesis Test
    • 100  When to use a one-sided or two-sided test?
    • 101  What if \(\sigma\) is unknown?
    • 102  The Central Limit Theorem (revisited)
    • 103  Statistical Test of the Population Mean with known Variance
    • 104  Statistical Test of the Population Mean with unknown Variance
    • 105  Statistical Test of the Variance
    • 106  Statistical Test of the Population Proportion
    • 107  Statistical Test of the Standard Deviation \(\sigma\)
    • 108  Statistical Test of the difference between Means -- Independent/Unpaired Samples
    • 109  Statistical Test of the difference between Means -- Dependent/Paired Samples
    • 110  Statistical Test of the difference between Variances -- Independent/Unpaired Samples

    • 111  Hypothesis Testing for Research Purposes
    • 112  Decision Thresholds, Alpha, and Confidence Levels
    • 113  Bayesian Inference for Decision-Making
    • 114  One Sample t-Test
    • 115  Skewness & Kurtosis Tests
    • 116  Paired Two Sample t-Test
    • 117  Wilcoxon Signed-Rank Test
    • 118  Unpaired Two Sample t-Test
    • 119  Unpaired Two Sample Welch Test
    • 120  Two One-Sided Tests (TOST) for Equivalence
    • 121  Mann-Whitney U test (Wilcoxon Rank-Sum Test)
    • 122  Bayesian Two Sample Test
    • 123  Median Test based on Notched Boxplots
    • 124  Chi-Squared Tests for Count Data
    • 125  Kolmogorov-Smirnov Test
    • 126  One Way Analysis of Variance (1-way ANOVA)
    • 127  Kruskal-Wallis Test
    • 128  Two Way Analysis of Variance (2-way ANOVA)
    • 129  Repeated Measures ANOVA
    • 130  Friedman Test
    • 131  Testing Correlations
    • 132  A Note on Causality

    • 133  Problems
  • Regression Models
    • 134  Simple Linear Regression Model (SLRM)
    • 135  Multiple Linear Regression Model (MLRM)
    • 136  Logistic Regression
    • 137  Generalized Linear Models
    • 138  Multinomial and Ordinal Logistic Regression
    • 139  Cox Proportional Hazards Regression
    • 140  Conditional Inference Trees
    • 141  Leaf Diagnostics for Conditional Inference Trees
    • 142  Conditional Random Forests
    • 143  Hypothesis Testing with Linear Regression Models (from a Practical Point of View)

    • 144  Problems
  • Introduction to Time Series Analysis
    • 145  Case: the Market of Health and Personal Care Products
    • 146  Decomposition of Time Series
    • 147  Ad hoc Forecasting of Time Series
  • Box-Jenkins Analysis
    • 148  Introduction to Box-Jenkins Analysis
    • 149  Theoretical Concepts
    • 150  Stationarity
    • 151  Identifying ARMA parameters
    • 152  Estimating ARMA Parameters and Residual Diagnostics
    • 153  Forecasting with ARIMA models
    • 154  Intervention Analysis
    • 155  Cross-Correlation Function
    • 156  Transfer Function Noise Models
    • 157  General-to-Specific Modeling
  • Model Building Strategies
    • 158  Introduction to Model Building Strategies
    • 159  Manual Model Building
    • 160  Model Validation
    • 161  Regularization Methods
    • 162  Hyperparameter Optimization Strategies
    • 163  Guided Model Building in Practice
    • 164  Diagnostics, Revision, and Guided Forecasting
    • 165  Leakage, Target Encoding, and Robust Regression
  • References
  • Appendices
    • Appendices
    • A  Method Selection Guide
    • B  Presentations and Teaching Materials
    • C  R Language Concepts for Statistical Computing
    • D  Matrix Algebra
    • E  Standard Normal Table (Gaussian Table)
    • F  Critical values of Student’s \(t\) distribution with \(\nu\) degrees of freedom
    • G  Upper-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom
    • H  Lower-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom

Table of contents

  • 94.1 Descriptive Analysis without time dimension
    • 94.1.1 Task 1
    • 94.1.2 Task 2
    • 94.1.3 Task 3
    • 94.1.4 Task 4
    • 94.1.5 Task 5
    • 94.1.6 Task 6
    • 94.1.7 Task 7
    • 94.1.8 Task 8
    • 94.1.9 Task 9
    • 94.1.10 Task 10
    • 94.1.11 Task 11
    • 94.1.12 Task 12
    • 94.1.13 Task 13
    • 94.1.14 Task 14
    • 94.1.15 Task 15
    • 94.1.16 Task 16
    • 94.1.17 Task 17
    • 94.1.18 Task 18
    • 94.1.19 Task 19
  • 94.2 Time Series
    • 94.2.1 Task 20
    • 94.2.2 Task 21
    • 94.2.3 Task 22
    • 94.2.4 Task 23
    • 94.2.5 Task 24
    • 94.2.6 Task 25
  1. Descriptive Statistics & Exploratory Data Analysis
  2. 94  Problems

94  Problems

94.1 Descriptive Analysis without time dimension

94.1.1 Task 1

  • Problem
  • Computation
  • Conclusion

Compute a Frequency Plot as described in Section 55.8.

Interactive Shiny app (click to load).
Open in new tab

Most students use the Windows operating system (Win NT6.0, Win NT5.1, and Win NT6.1). Few students use the MacOSX operating system and only one student uses GNU/Linux.

94.1.2 Task 2

  • Problem
  • Computation
  • Conclusion

Compute the Stem-and-Leaf Plot as described in Section 61.8.

Interactive Shiny app (click to load).
Open in new tab

The Stem-and-Leaf Plot shows a distribution which is extremely skewed to the right. Hence, it is difficult to determine the mean of the data series. Hint: move the trimming slider to the right to obtain a better image (you can also determine the “trimmed” mean).

94.1.3 Task 3

  • Problem
  • Computation 1
  • Computation 2
  • Conclusion 1
  • Conclusion 2

Recompute the Histogram and answer both questions of Section 62.14.

Interactive Shiny app (click to load).
Open in new tab
Interactive Shiny app (click to load).
Open in new tab

The Histogram with 50 bins shows much more detail. The associated Frequency Table can be used to determine several types of Central Tendency. The Mode is located in the bin [200, 300[ because it has the highest absolute frequency. The Median is between 200 and 300 seconds because the cumulative relative frequency for bin [200, 300[ is 67.6% (the bin to the left only has a cumulative percentage of 28.8%).

The second Histogram in Section 62.13 is much easier to interpret because it makes sure that the Likert scores are defined as the center of each bin. The Histogram with “Unknown” scale and only 6 bins is misleading because all observations lie on the boundaries of the bins (e.g. the first bin contains all observations with values 1 and 2).

94.1.4 Task 4

  • Problem
  • Computation
  • Conclusion

Compute the 95% interval with the Harrell-Davis method as described in Section 64.15.

Interactive Shiny app (click to load).
Open in new tab

The 95% interval is [12.38, 26.68]. The step size was set to 0.005 because this allows us to use the exact values for [\(Quantile(0.025)\), \(Quantile(0.975)\)], i.e. without the need to compute an interpolation between two adjacent values.

94.1.5 Task 5

  • Problem
  • Computation
  • Conclusion

Determine the “best” measure of Central Tendency as described in Section 65.21.

Interactive Shiny app (click to load).
Open in new tab

The Arithmetic Mean is probably not a good choice when we wish to make a prediction for the time needed to submit the survey. The reason is that there are several outliers in the data set which heavily influence the Arithmetic Mean. The Figures of the Trimmed Mean and Winsorized Mean are both decreasing, implying that the Arithmetic Mean would be much lower if extreme values are systematically eliminated. Both Figures converge towards the Median which is a robust measure of Central Tendency. From the Histogram and Stem-and-Leaf Plot we also know that the data set does not have a Uniform Distribution, hence the Midrange is also not an appropriate choice. Furthermore, the Geometric Mean and Harmonic Mean are both not appropriate because we are not dealing with growth rates or output/input ratios.

Taking all these reasons into account, and making the assumption that we wish to make a robust prediction, the best estimate is provided by the Median (= 241.171).

94.1.6 Task 6

  • Problem
  • Computation
  • Conclusion

Use the Skewness and Kurtosis Tests in the Computation tab to analyse the Birth Weight dataset. Do you think that the data has a Normal Distribution?

Interactive Shiny app (click to load).
Open in new tab

The D’Agostino Skewness statistic is equal to -0.20698 which seem to be sufficiently close to zero to conclude that the distribution is fairly symmetric. In addition, the Kurtosis test statistic is 2.88678 which very close to 3 and corresponds to the Kurtosis of a Normal Distribution. On first sight, both results, are in line with those of the Normal Distribution.

Important note: we are not considering the fact that there is a probability of making the wrong conclusion. After all, the dataset only contains weights of a limited number of infants. This problem is the subject of Hypothesis Testing and will be discussed later.

94.1.7 Task 7

  • Problem
  • Conclusion

Use the Skewness-Kurtosis Plot from Section 67.18 and compare the markers with the theory described in Probability Distributions.

The following markers can be observed in the Figure:

  • Triangle: represents the Uniform Distribution with Skewness = 0 ( Section 19.8) and Kurtosis = 9/5 (Section 19.9).
  • Plus: represents the Logistic Distribution with Skewness = 0 and Kurtosis = 4.2
  • Star: represents the Normal Distribution with Skewness = 0 (Section 20.16) and Kurtosis = 3 (Section 20.17).
  • X in Box: represents the Exponential Distribution with Skewness = 2 and Kurtosis = 9

94.1.8 Task 8

  • Problem
  • Computation
  • Conclusion

Examine the effect of age on IM.Know as described in Section 69.7.

Interactive Shiny app (click to load).
Open in new tab

To do this you must set the “arrangement of groups” to “long format” and select both variables IM.Know and age. Note: the quantitative variable is always assumed to be in the first position.

94.1.9 Task 9

  • Problem
  • Computation
  • Conclusion

Examine the Scatterplot for discrete variables as described in Section 70.8. Examine the Scatterplot for the two discrete variables and explain the problems that you see.

Interactive Shiny app (click to load).
Open in new tab

Scatterplots are not well-suited to examine discrete variables because multiple points are on exactly the same position. This implies that it is impossible to determine which areas of the Scatter Plot have the highest density of points. The Scatterplot produced by RFC, however, also displays Histograms for both variables which provide (at least) some indication of where most points are located.

As an alternative one could add a very small random number to each observation (this is called “jitter”). A “jittered” Scatter Plot would show where most points are located because the coordinates are slightly randomized. In RFC we don’t use the jittering because there is a better solution which is discussed at a later stage (i.e. the Bivariate Kernel Density Plot).

94.1.10 Task 10

  • Problem
  • Conclusion

Find an example of nonsense correlation and explain why it is spurious (this website might be helpful: https://tylervigen.com/spurious-correlations).

A common source of a spurious correlation between \(X\) and \(Y\) is found when a third (unobserved) variable \(Z\) has an impact on \(X\) and \(Y\). It is possible to remove the effect of \(Z\) by computing a Partial Correlation – obviously, this method can only be used when \(Z\) is actually observed.

94.1.11 Task 11

  • Problem
  • Conclusion

Discuss the question in Section 73.6.

The variable Learning is most closely associated with software competence (the Pearson Correlation is 0.62 which indicates a strong linear relationship). If we wish to control for the (confounding) effects of Happiness, Sport1, and Depression, we need to examine the Partial Pearson Correlations as well. In this case, the conclusion remains exactly the same because the Partial Correlation (between Software and Learning) is 0.58 which is very close to the (ordinary) Pearson Correlation and which indicates that the control variables (i.e. Happiness, Sport1, and Depression) do not influence the measured relationship between Learning and Software.

Does this imply that learning confidence and software competence are truly related to each other? No, it does not, because it is still possible that there are other (yet unobserved) variables which might have an obfuscating effect. On the other hand, the results from the Partial Pearson Correlation matrix increase our trust in the proposition that learning confidence is truly related to software competence.

94.1.12 Task 12

  • Problem
  • Computation
  • Conclusion

Use the Computation tab to investigate the time needed by students to submit a short survey (in seconds) based on the QQ Plot for the Normal Distribution. We know that this series contains extreme values (some students paused the survey for a long time) – therefore, we want to investigate the effect of trimming on the distribution. Does trimming cause the data to behave like a Normal Distribution?

Interactive Shiny app (click to load).
Open in new tab

The analysis shows that the data is not Normally Distributed (even when a maximum of 10% trimming is applied to both sides of the distribution). On the other hand, there is a noticeable improvement when the trimming slider is moved to the right.

94.1.13 Task 13

  • Problem
  • Computation
  • Conclusion

Use the Tukey-Lambda PPCC Plot to examine the Divorces time series as described in Section 78.8.

Interactive Shiny app (click to load).
Open in new tab

The table of the Tukey-Lambda PPCC Plot shows that the highest Pearson Correlation is reached for \(\lambda = 0.14\) which corresponds to the Normal Distribution. This means that the Normal Distribution is a better fit for the Divorces time series than the other distributions that are listed (i.e. the Cauchy, Logistic, U-shaped, and Uniform Distributions).

Note: this procedure only works for symmetric distributions, so we are assuming that the Divorces have zero Skewness!

94.1.14 Task 14

  • Problem
  • Computation 1
  • Computation 2
  • Conclusion

Compare the Marriages and Divorces time series based on the Kernel Density Plots (Section 80.13).

Interactive Shiny app (click to load).
Open in new tab
Interactive Shiny app (click to load).
Open in new tab

Marriages

The Kernel Density Plots shows a bimodal distribution which means that there are (many) months with a relatively low level and (many) other months with a relatively high level of marriages. The reason why this is the case will be discussed in Chapter 88. For now it is sufficient to think of the Marriages time series in terms of “popular months” and “unpopular months”, causing the distribution to have a bimodal shape. Perhaps, there are months which are popular because of the expected weather?

Divorces

The Kernel Density Plot shows a unimodal distribution which means that in most months the number of divorces is roughly equal. Should there be any reason why couples would want to divorce in a specific month of the year?

Final Conclusion

Couples might have reasons to choose specific months of the year to get married which leads to a bimodal distribution. For couples who want to divorce there might be no or little incentive to choose a specific month of the year.

94.1.15 Task 15

  • Problem
  • Computation
  • Conclusion

Examine the data shown in the Computation tab and describe what you see in the Bivariate Kernel Density Plot. Also have a look at the 3D plot (select “persp” in the “Type of plot” dropdown menu).

Interactive Shiny app (click to load).
Open in new tab

The plot shows that there are only six areas (clusters) where points are present. Only three of them have a higher density, containing many points. The area with the highest density is located at the top left section of the graph. The Bivariate Kernel Density conveys more information than an ordinary Scatter Plot because there is a third dimension (i.e. the density of points) which is represented by drawing contours of equal density and using a color scheme which indicates the “height” of the density.

94.1.16 Task 16

  • Problem
  • Computation
  • Conclusion

Interpret the Bivariate Kernel Density Plot shown in the Computation tab.

Interactive Shiny app (click to load).
Open in new tab

The plot suggests a non-linear relationship between the variables “dis” and “nox” and indicates that the observation pairs are clustered in several areas. Each cluster exhibits a negative, linear relationship between the variables. The clusters themselves, however, are arranged in a non-linear manner. The plot does not explain why these patterns emerge, it only helps us detect them.

94.1.17 Task 17

  • Problem
  • Histogram
  • Computation 1
  • Computation 2
  • Computation 3
  • Conclusion

Examine the students’ Numeracy Scores (collected in the past) and use the Bootstrap Plot to predict the scores of the current academic year. The data can be found in the Histogram tab (you can copy the data and use them in any other R module).

Interactive Shiny app (click to load).
Open in new tab
Interactive Shiny app (click to load).
Open in new tab
Interactive Shiny app (click to load).
Open in new tab
Interactive Shiny app (click to load).
Open in new tab

Before we compute the Bootstrap Plot, it is useful to investigate the distribution of numeracy scores. The results are shown in the Computation 1 tab and clearly shows that:

  • The data are not from a Uniform Distribution which allows us to conclude that the Midrange is not an adequate measure of Central Tendency.
  • The data are not from a Normal Distribution because of deviations in the left tail. This implies that there are students with extremely low numeracy scores which might have a biasing effect on the Arithmetic Mean.

In order to see more details about the distribution it is possible to use the Kernel Density Plot as shown in the Computation 2 tab. The Gaussian Kernel shows that the distribution of numeracy scores is skewed to the left. In addition, there are two modes in the neighborhood of the numeracy score = 20 which might be explained by the fact that the student group is heterogeneous1.

The Bootstrap Plot (as shown in Computation 3) computes five measures of Central Tendency: Arithmetic Mean, Median, Midrange, Harmonic Mean, and Geometric Mean. In this case, three of these measures can be discarded: the Midrange (because the distribution is not uniform) and the Harmonic and Geometric means (because they have a huge variability).

The remaining two measures of Central Tendency could be used to make predictions but they each have different properties and can be used for specific purposes:

  • The Arithmetic Mean attributes an equal weight to numeracy levels of each student, including those with extremely low scores. The Kernel Density Plot (of simulated Arithmetic Means) looks like a Normal Distribution and the associated Notched Box Plot and bootstrap table shows that the Variability of the Arithmetic Mean, as measured by the Standard Deviation and the Interquartile Range, is very small (producing a small 50% interval of [19.783, 20.205] around the estimate of 19.996). The 95% interval is [19.331, 20.54] which is a symmetrical around the estimated value of 19.996.

  • The Median discards all extremes and predicts the numeracy level based on the student for whom 50% of peers perform better and the other 50% perform worse. The Median has a funny looking, multi modal distribution, as can be seen in the Kernel Density Plot, and predicts the numeracy to be 20 with a 50% interval of [20, 21]. The 95% interval is exactly the same, i.e. [20, 21], which lies asymmetrically around the estimate of 20.

So if we wish to make a probabilistic prediction of a randomly selected subgroup we need to make additional assumptions which lead to different answers:

  • The 95% confidence interval for the population mean numeracy score (including all students that have enrolled in the statistics course) is [19.331, 20.54].

  • The 95% confidence interval for the trimmed-population mean numeracy score (excluding students with very low or very high scores) is [20, 21].

Conclusion: the right answer does not only depend on which measure of Central Tendency has the smallest Variability, it mainly depends on which question we wish to answer (i.e. which additional assumptions we want to make regarding the type of student that should be considered). Furthermore, the answer would be completely different if we were to split the dataset into homogeneous subgroups (e.g. females and males).

A final note: the trimming percentages can have a big impact on the results. Move the trimming slider to observe how sensitive the results are.

94.1.18 Task 18

  • Problem
  • Computation
  • Conclusion

Compute the SSROC analysis (Chapter 84) for the following items of the AMS dataset: Q1_5, Q1_12, Q1_19, and Q1_26. These items are used to measure students’ Amotivation (i.e. a lack of motivation to engage in higher education). Do you think we can add these items together to obtain a measure for Amotivation? Are there any items that should be left out?

Interactive Shiny app (click to load).
Open in new tab

When we compute the alternative SSROC scores, we obtain measures for each item that have a high rank correlation with the Arithmetic mean (\(\tau \simeq 0.91\)). In other words, there is evidence to suggest that the Arithmetic Mean (or the simple sum) of all items would preserve the rank order of obtained measurements.

The analysis also shows that the Cronbach \(\alpha\) (for all items) is 0.8878. This value cannot be increased by eliminating any of the four items.

94.1.19 Task 19

  • Problem
  • Computation
  • Conclusion

Have another look at the AMS dataset and show that the item Q1_2 was used to compute IM.Know (i.e. it is one of the items that was added to construct IM.Know) and item Q1_1 was not. Use Notched Boxplots and Kendall’s \(\tau\) rank correlations.

Interactive Shiny app (click to load).
Open in new tab

The results in the Computation tab show the boxplots of IM.Know (as the quantitative variable) versus Q1_1 as the categorical variable (note that the boxplots were constructed with the “long format” setting). All the boxplots seem to be at the same level, hence there’s no reason to believe that Q1_1 contributes to the measurement of IM.Know.

Now change the categorical variable Q1_1 into Q1_2 and observe how the the boxplots are now showing an increasing pattern. In other words, higher Q1_2 answers correspond to higher IM.Know answers. The same phenomenon can be observed when using the Correlations tab of the R module.

94.2 Time Series

94.2.1 Task 20

  • Problem
  • Computation
  • Conclusion

Analyze the monthly Marriages time series as described in Section 88.6.

Interactive Shiny app (click to load).
Open in new tab

The Notched Box Plots (for periodic subseries) provide much more useful information about the seasonal pattern because the Median number of Marriages during the period May-September is clearly substantially higher than in other months (the Notches do not overlap, indicating that the pattern is not due to chance). The same conclusion can be drawn from the differenced periodic subseries because there are many Boxplot pairs with non-overlapping Notches (i.e. different Medians).

The Notched Box Plots for sequential blocks (i.e. years) do not show any evidence of a long-run trend. This is also the reason why the previous two plots clearly indicate a seasonal pattern: only when there is a strong, long-run trend in the time series it is possible that the seasonal pattern is obfuscated when no differencing is applied (see example of the Airline data in Section 88.5.

Conclusion: the Marriages time series exhibits no long-run trend but only a seasonal pattern.

94.2.2 Task 21

  • Problem
  • Histogram
  • Computation 1
  • Computation 2
  • Conclusion

Generate a prediction based on the Blocked Bootstrap Plot for the time series “Rainfall in Nottingham Castle” as is shown in the Histogram tab (copy the data to any other R module).

Interactive Shiny app (click to load).
Open in new tab
Interactive Shiny app (click to load).
Open in new tab
Interactive Shiny app (click to load).
Open in new tab

The Histogram shows two bins with high absolute frequency, i.e. 14 cases and 13 cases. The shape of the Histogram depends on the number of bins that is chosen – hence, the bimodality of the Rainfall series is only detected if an appropriate choice is made. The Histogram in the bookmarked computation might have just enough bins to detect the bimodal nature of the time series (try to recompute the Histogram with more bins).

The Gaussian Kernel Density Plot in Computation 1 clearly shows a bimodal distribution of Rainfall. The shape of this plot also depends on a parameter (i.e. the so-called “bandwidth” parameter). However, the software is often able to choose an appropriate default value which allows one to detect the interesting features of the underlying distribution.

Clearly the Harmonic Mean provides the estimate with the smalled Standard Deviation and Inter Quartile Range. The Arithmetic Mean and Geometric Mean are very close (in terms of variability) but if we prefer a predictor with the highest confidence, the Harmonic seems to win.

The Blocked Bootstrap Plot provides adequate and useful information about the empirical distribution but it requires the user to have some practical experience (because sometimes the simulations can fail). When the confidence intervals are symmetrically distributed around the estimate, the (Blocked) Bootstrap Plot provides very useful information. When the intervals are not symmetric, one should be wary of the possibility that the interpretation could be problematic, especially if the data under investigation is heterogeneous. When the estimate falls outside the 50% interval one should not use the bootstrap results.

Finally, note that the method employed here involves simulations to derive intervals. This is a probabilistic process which implies that every computation will yield different results. The discrepancies between computations, however, will become reasonably small when a sufficient number of simulations is used.

94.2.3 Task 22

  • Problem
  • Computation
  • Conclusion

Analyze the monthly Marriages time series as described in Section 90.6.

Interactive Shiny app (click to load).
Open in new tab

The SMP provides evidence that the Standard Deviations of subsequent years can be explained by the corresponding Mean of the same year. This is a typical pattern which is often encountered in biology and economics. This implies that the (annual) Variability of the time series is not stable over time. In practice, we will have to apply some sort of transformation in order to induce stability of the Variance (or Standard Deviation).

94.2.4 Task 23

  • Problem
  • Computation
  • Conclusion

Analyze the monthly Marriages time series as described in Section 91.8.

Interactive Shiny app (click to load).
Open in new tab

The Variance Reduction Matrix demonstrates that only seasonal differencing (i.e. \(d = 0\) and \(D = 1\)) is required to make the Variance as small as possible. This implies that the time series contains a strong seasonal pattern which can be removed through seasonal differencing. Exactly the same conclusion is obtained when we use the Range or the Trimmed Variance.

94.2.5 Task 24

  • Problem
  • Computation
  • Conclusion

Based on the HPC time series, use autocorrelations to identify the long-run trend and seasonality.

Interactive Shiny app (click to load).
Open in new tab

First we compute the ACF for \(d = D = 0\) and observe a slowly decreasing pattern of autocorrelation coefficients. We decide to apply non-seasonal differencing – when the ACF is re-computed with \(d = 1\) we can see a seasonal trend pattern emerge. Therefore, an additional seasonal differencing operation must be applied. The result shown in the Computation tab shows that \(d = D = 1\) allows us to remove the seasonal and non-seasonal trend.

94.2.6 Task 25

  • Problem
  • Computation
  • Conclusion

Repeat the previous task but use the Cumulative Periodogram instead.

Interactive Shiny app (click to load).
Open in new tab

The Computation tab shows the result for \(d = D = 1\). It is clear that the “big steps” are not present in the cumulative periodogram at this level of differencing. When the differencing sliders are set back to zero, we can see how the non-seasonal and seasonal patterns are shown.


  1. In fact we know that this is the case. For instance, there is a substantial difference between numeracy scores of males and females.↩︎

Hypothesis Testing

© 2026 Patrick Wessa. Provided as-is, without warranty.

Feedback: e-mail | Anonymous contributions: click to copy (Sats) | click to copy (XMR)

Cookie Preferences