94 Problems

94.1 Descriptive Analysis without time dimension

94.1.1 Task 1

Compute a Frequency Plot as described in Section 55.8.

Interactive Shiny app (click to load).

Most students use the Windows operating system (Win NT6.0, Win NT5.1, and Win NT6.1). Few students use the MacOSX operating system and only one student uses GNU/Linux.

94.1.2 Task 2

Compute the Stem-and-Leaf Plot as described in Section 61.8.

Interactive Shiny app (click to load).

Open in new tab

The Stem-and-Leaf Plot shows a distribution which is extremely skewed to the right. Hence, it is difficult to determine the mean of the data series. Hint: move the trimming slider to the right to obtain a better image (you can also determine the “trimmed” mean).

94.1.3 Task 3

Recompute the Histogram and answer both questions of Section 62.14.

Interactive Shiny app (click to load).

Open in new tab

Interactive Shiny app (click to load).

Open in new tab

The Histogram with 50 bins shows much more detail. The associated Frequency Table can be used to determine several types of Central Tendency. The Mode is located in the bin [200, 300[ because it has the highest absolute frequency. The Median is between 200 and 300 seconds because the cumulative relative frequency for bin [200, 300[ is 67.6% (the bin to the left only has a cumulative percentage of 28.8%).

The second Histogram in Section 62.13 is much easier to interpret because it makes sure that the Likert scores are defined as the center of each bin. The Histogram with “Unknown” scale and only 6 bins is misleading because all observations lie on the boundaries of the bins (e.g. the first bin contains all observations with values 1 and 2).

94.1.4 Task 4

Compute the 95% interval with the Harrell-Davis method as described in Section 64.15.

Interactive Shiny app (click to load).

Open in new tab

The 95% interval is [12.38, 26.68]. The step size was set to 0.005 because this allows us to use the exact values for [\(Quantile(0.025)\), \(Quantile(0.975)\)], i.e. without the need to compute an interpolation between two adjacent values.

94.1.5 Task 5

Determine the “best” measure of Central Tendency as described in Section 65.21.

Interactive Shiny app (click to load).

Open in new tab

The Arithmetic Mean is probably not a good choice when we wish to make a prediction for the time needed to submit the survey. The reason is that there are several outliers in the data set which heavily influence the Arithmetic Mean. The Figures of the Trimmed Mean and Winsorized Mean are both decreasing, implying that the Arithmetic Mean would be much lower if extreme values are systematically eliminated. Both Figures converge towards the Median which is a robust measure of Central Tendency. From the Histogram and Stem-and-Leaf Plot we also know that the data set does not have a Uniform Distribution, hence the Midrange is also not an appropriate choice. Furthermore, the Geometric Mean and Harmonic Mean are both not appropriate because we are not dealing with growth rates or output/input ratios.

Taking all these reasons into account, and making the assumption that we wish to make a robust prediction, the best estimate is provided by the Median (= 241.171).

94.1.6 Task 6

Use the Skewness and Kurtosis Tests in the Computation tab to analyse the Birth Weight dataset. Do you think that the data has a Normal Distribution?

Interactive Shiny app (click to load).

Open in new tab

The D’Agostino Skewness statistic is equal to -0.20698 which seem to be sufficiently close to zero to conclude that the distribution is fairly symmetric. In addition, the Kurtosis test statistic is 2.88678 which very close to 3 and corresponds to the Kurtosis of a Normal Distribution. On first sight, both results, are in line with those of the Normal Distribution.

Important note: we are not considering the fact that there is a probability of making the wrong conclusion. After all, the dataset only contains weights of a limited number of infants. This problem is the subject of Hypothesis Testing and will be discussed later.

94.1.7 Task 7

Use the Skewness-Kurtosis Plot from Section 67.18 and compare the markers with the theory described in Probability Distributions.

The following markers can be observed in the Figure:

Triangle: represents the Uniform Distribution with Skewness = 0 ( Section 19.8) and Kurtosis = 9/5 (Section 19.9).
Plus: represents the Logistic Distribution with Skewness = 0 and Kurtosis = 4.2
Star: represents the Normal Distribution with Skewness = 0 (Section 20.16) and Kurtosis = 3 (Section 20.17).
X in Box: represents the Exponential Distribution with Skewness = 2 and Kurtosis = 9

94.1.8 Task 8

Examine the effect of age on IM.Know as described in Section 69.7.

Interactive Shiny app (click to load).

Open in new tab

To do this you must set the “arrangement of groups” to “long format” and select both variables IM.Know and age. Note: the quantitative variable is always assumed to be in the first position.

94.1.9 Task 9

Examine the Scatterplot for discrete variables as described in Section 70.8. Examine the Scatterplot for the two discrete variables and explain the problems that you see.

Interactive Shiny app (click to load).

Open in new tab

Scatterplots are not well-suited to examine discrete variables because multiple points are on exactly the same position. This implies that it is impossible to determine which areas of the Scatter Plot have the highest density of points. The Scatterplot produced by RFC, however, also displays Histograms for both variables which provide (at least) some indication of where most points are located.

As an alternative one could add a very small random number to each observation (this is called “jitter”). A “jittered” Scatter Plot would show where most points are located because the coordinates are slightly randomized. In RFC we don’t use the jittering because there is a better solution which is discussed at a later stage (i.e. the Bivariate Kernel Density Plot).

94.1.10 Task 10

Find an example of nonsense correlation and explain why it is spurious (this website might be helpful: https://tylervigen.com/spurious-correlations).

A common source of a spurious correlation between \(X\) and \(Y\) is found when a third (unobserved) variable \(Z\) has an impact on \(X\) and \(Y\). It is possible to remove the effect of \(Z\) by computing a Partial Correlation – obviously, this method can only be used when \(Z\) is actually observed.

94.1.11 Task 11

Discuss the question in Section 73.6.

The variable Learning is most closely associated with software competence (the Pearson Correlation is 0.62 which indicates a strong linear relationship). If we wish to control for the (confounding) effects of Happiness, Sport1, and Depression, we need to examine the Partial Pearson Correlations as well. In this case, the conclusion remains exactly the same because the Partial Correlation (between Software and Learning) is 0.58 which is very close to the (ordinary) Pearson Correlation and which indicates that the control variables (i.e. Happiness, Sport1, and Depression) do not influence the measured relationship between Learning and Software.

Does this imply that learning confidence and software competence are truly related to each other? No, it does not, because it is still possible that there are other (yet unobserved) variables which might have an obfuscating effect. On the other hand, the results from the Partial Pearson Correlation matrix increase our trust in the proposition that learning confidence is truly related to software competence.

94.1.12 Task 12

Use the Computation tab to investigate the time needed by students to submit a short survey (in seconds) based on the QQ Plot for the Normal Distribution. We know that this series contains extreme values (some students paused the survey for a long time) – therefore, we want to investigate the effect of trimming on the distribution. Does trimming cause the data to behave like a Normal Distribution?

Interactive Shiny app (click to load).

Open in new tab

The analysis shows that the data is not Normally Distributed (even when a maximum of 10% trimming is applied to both sides of the distribution). On the other hand, there is a noticeable improvement when the trimming slider is moved to the right.

94.1.13 Task 13

Use the Tukey-Lambda PPCC Plot to examine the Divorces time series as described in Section 78.8.

Interactive Shiny app (click to load).

Open in new tab

The table of the Tukey-Lambda PPCC Plot shows that the highest Pearson Correlation is reached for \(\lambda = 0.14\) which corresponds to the Normal Distribution. This means that the Normal Distribution is a better fit for the Divorces time series than the other distributions that are listed (i.e. the Cauchy, Logistic, U-shaped, and Uniform Distributions).

Note: this procedure only works for symmetric distributions, so we are assuming that the Divorces have zero Skewness!

94.1.14 Task 14

Compare the Marriages and Divorces time series based on the Kernel Density Plots (Section 80.13).

Interactive Shiny app (click to load).

Open in new tab

Interactive Shiny app (click to load).

Open in new tab

Marriages

The Kernel Density Plots shows a bimodal distribution which means that there are (many) months with a relatively low level and (many) other months with a relatively high level of marriages. The reason why this is the case will be discussed in Chapter 88. For now it is sufficient to think of the Marriages time series in terms of “popular months” and “unpopular months”, causing the distribution to have a bimodal shape. Perhaps, there are months which are popular because of the expected weather?

Divorces

The Kernel Density Plot shows a unimodal distribution which means that in most months the number of divorces is roughly equal. Should there be any reason why couples would want to divorce in a specific month of the year?

Final Conclusion

Couples might have reasons to choose specific months of the year to get married which leads to a bimodal distribution. For couples who want to divorce there might be no or little incentive to choose a specific month of the year.

94.1.15 Task 15

Examine the data shown in the Computation tab and describe what you see in the Bivariate Kernel Density Plot. Also have a look at the 3D plot (select “persp” in the “Type of plot” dropdown menu).

Interactive Shiny app (click to load).

Open in new tab

The plot shows that there are only six areas (clusters) where points are present. Only three of them have a higher density, containing many points. The area with the highest density is located at the top left section of the graph. The Bivariate Kernel Density conveys more information than an ordinary Scatter Plot because there is a third dimension (i.e. the density of points) which is represented by drawing contours of equal density and using a color scheme which indicates the “height” of the density.

94.1.16 Task 16

Interpret the Bivariate Kernel Density Plot shown in the Computation tab.

Interactive Shiny app (click to load).

Open in new tab

The plot suggests a non-linear relationship between the variables “dis” and “nox” and indicates that the observation pairs are clustered in several areas. Each cluster exhibits a negative, linear relationship between the variables. The clusters themselves, however, are arranged in a non-linear manner. The plot does not explain why these patterns emerge, it only helps us detect them.

94.1.17 Task 17

Examine the students’ Numeracy Scores (collected in the past) and use the Bootstrap Plot to predict the scores of the current academic year. The data can be found in the Histogram tab (you can copy the data and use them in any other R module).

Interactive Shiny app (click to load).

Open in new tab

Interactive Shiny app (click to load).

Open in new tab

Interactive Shiny app (click to load).

Open in new tab

Interactive Shiny app (click to load).

Open in new tab

Before we compute the Bootstrap Plot, it is useful to investigate the distribution of numeracy scores. The results are shown in the Computation 1 tab and clearly shows that:

The data are not from a Uniform Distribution which allows us to conclude that the Midrange is not an adequate measure of Central Tendency.
The data are not from a Normal Distribution because of deviations in the left tail. This implies that there are students with extremely low numeracy scores which might have a biasing effect on the Arithmetic Mean.

In order to see more details about the distribution it is possible to use the Kernel Density Plot as shown in the Computation 2 tab. The Gaussian Kernel shows that the distribution of numeracy scores is skewed to the left. In addition, there are two modes in the neighborhood of the numeracy score = 20 which might be explained by the fact that the student group is heterogeneous¹.

The Bootstrap Plot (as shown in Computation 3) computes five measures of Central Tendency: Arithmetic Mean, Median, Midrange, Harmonic Mean, and Geometric Mean. In this case, three of these measures can be discarded: the Midrange (because the distribution is not uniform) and the Harmonic and Geometric means (because they have a huge variability).

The remaining two measures of Central Tendency could be used to make predictions but they each have different properties and can be used for specific purposes:

The Arithmetic Mean attributes an equal weight to numeracy levels of each student, including those with extremely low scores. The Kernel Density Plot (of simulated Arithmetic Means) looks like a Normal Distribution and the associated Notched Box Plot and bootstrap table shows that the Variability of the Arithmetic Mean, as measured by the Standard Deviation and the Interquartile Range, is very small (producing a small 50% interval of [19.783, 20.205] around the estimate of 19.996). The 95% interval is [19.331, 20.54] which is a symmetrical around the estimated value of 19.996.
The Median discards all extremes and predicts the numeracy level based on the student for whom 50% of peers perform better and the other 50% perform worse. The Median has a funny looking, multi modal distribution, as can be seen in the Kernel Density Plot, and predicts the numeracy to be 20 with a 50% interval of [20, 21]. The 95% interval is exactly the same, i.e. [20, 21], which lies asymmetrically around the estimate of 20.

So if we wish to make a probabilistic prediction of a randomly selected subgroup we need to make additional assumptions which lead to different answers:

The 95% confidence interval for the population mean numeracy score (including all students that have enrolled in the statistics course) is [19.331, 20.54].
The 95% confidence interval for the trimmed-population mean numeracy score (excluding students with very low or very high scores) is [20, 21].

Conclusion: the right answer does not only depend on which measure of Central Tendency has the smallest Variability, it mainly depends on which question we wish to answer (i.e. which additional assumptions we want to make regarding the type of student that should be considered). Furthermore, the answer would be completely different if we were to split the dataset into homogeneous subgroups (e.g. females and males).

A final note: the trimming percentages can have a big impact on the results. Move the trimming slider to observe how sensitive the results are.

94.1.18 Task 18

Compute the SSROC analysis (Chapter 84) for the following items of the AMS dataset: Q1_5, Q1_12, Q1_19, and Q1_26. These items are used to measure students’ Amotivation (i.e. a lack of motivation to engage in higher education). Do you think we can add these items together to obtain a measure for Amotivation? Are there any items that should be left out?

Interactive Shiny app (click to load).

Open in new tab

When we compute the alternative SSROC scores, we obtain measures for each item that have a high rank correlation with the Arithmetic mean (\(\tau \simeq 0.91\)). In other words, there is evidence to suggest that the Arithmetic Mean (or the simple sum) of all items would preserve the rank order of obtained measurements.

The analysis also shows that the Cronbach \(\alpha\) (for all items) is 0.8878. This value cannot be increased by eliminating any of the four items.

94.1.19 Task 19

Have another look at the AMS dataset and show that the item Q1_2 was used to compute IM.Know (i.e. it is one of the items that was added to construct IM.Know) and item Q1_1 was not. Use Notched Boxplots and Kendall’s \(\tau\) rank correlations.

Interactive Shiny app (click to load).

Open in new tab

The results in the Computation tab show the boxplots of IM.Know (as the quantitative variable) versus Q1_1 as the categorical variable (note that the boxplots were constructed with the “long format” setting). All the boxplots seem to be at the same level, hence there’s no reason to believe that Q1_1 contributes to the measurement of IM.Know.

Now change the categorical variable Q1_1 into Q1_2 and observe how the the boxplots are now showing an increasing pattern. In other words, higher Q1_2 answers correspond to higher IM.Know answers. The same phenomenon can be observed when using the Correlations tab of the R module.

94.2 Time Series

94.2.1 Task 20

Analyze the monthly Marriages time series as described in Section 88.6.

Interactive Shiny app (click to load).

Open in new tab

The Notched Box Plots (for periodic subseries) provide much more useful information about the seasonal pattern because the Median number of Marriages during the period May-September is clearly substantially higher than in other months (the Notches do not overlap, indicating that the pattern is not due to chance). The same conclusion can be drawn from the differenced periodic subseries because there are many Boxplot pairs with non-overlapping Notches (i.e. different Medians).

The Notched Box Plots for sequential blocks (i.e. years) do not show any evidence of a long-run trend. This is also the reason why the previous two plots clearly indicate a seasonal pattern: only when there is a strong, long-run trend in the time series it is possible that the seasonal pattern is obfuscated when no differencing is applied (see example of the Airline data in Section 88.5.

Conclusion: the Marriages time series exhibits no long-run trend but only a seasonal pattern.

94.2.2 Task 21

Generate a prediction based on the Blocked Bootstrap Plot for the time series “Rainfall in Nottingham Castle” as is shown in the Histogram tab (copy the data to any other R module).

Interactive Shiny app (click to load).

Open in new tab

Interactive Shiny app (click to load).

Open in new tab

Interactive Shiny app (click to load).

Open in new tab

The Histogram shows two bins with high absolute frequency, i.e. 14 cases and 13 cases. The shape of the Histogram depends on the number of bins that is chosen – hence, the bimodality of the Rainfall series is only detected if an appropriate choice is made. The Histogram in the bookmarked computation might have just enough bins to detect the bimodal nature of the time series (try to recompute the Histogram with more bins).

The Gaussian Kernel Density Plot in Computation 1 clearly shows a bimodal distribution of Rainfall. The shape of this plot also depends on a parameter (i.e. the so-called “bandwidth” parameter). However, the software is often able to choose an appropriate default value which allows one to detect the interesting features of the underlying distribution.

Clearly the Harmonic Mean provides the estimate with the smalled Standard Deviation and Inter Quartile Range. The Arithmetic Mean and Geometric Mean are very close (in terms of variability) but if we prefer a predictor with the highest confidence, the Harmonic seems to win.

The Blocked Bootstrap Plot provides adequate and useful information about the empirical distribution but it requires the user to have some practical experience (because sometimes the simulations can fail). When the confidence intervals are symmetrically distributed around the estimate, the (Blocked) Bootstrap Plot provides very useful information. When the intervals are not symmetric, one should be wary of the possibility that the interpretation could be problematic, especially if the data under investigation is heterogeneous. When the estimate falls outside the 50% interval one should not use the bootstrap results.

Finally, note that the method employed here involves simulations to derive intervals. This is a probabilistic process which implies that every computation will yield different results. The discrepancies between computations, however, will become reasonably small when a sufficient number of simulations is used.

94.2.3 Task 22

Analyze the monthly Marriages time series as described in Section 90.6.

Interactive Shiny app (click to load).

Open in new tab

The SMP provides evidence that the Standard Deviations of subsequent years can be explained by the corresponding Mean of the same year. This is a typical pattern which is often encountered in biology and economics. This implies that the (annual) Variability of the time series is not stable over time. In practice, we will have to apply some sort of transformation in order to induce stability of the Variance (or Standard Deviation).

94.2.4 Task 23

Analyze the monthly Marriages time series as described in Section 91.8.

Interactive Shiny app (click to load).

Open in new tab

The Variance Reduction Matrix demonstrates that only seasonal differencing (i.e. \(d = 0\) and \(D = 1\)) is required to make the Variance as small as possible. This implies that the time series contains a strong seasonal pattern which can be removed through seasonal differencing. Exactly the same conclusion is obtained when we use the Range or the Trimmed Variance.

94.2.5 Task 24

Based on the HPC time series, use autocorrelations to identify the long-run trend and seasonality.

Interactive Shiny app (click to load).

Open in new tab

First we compute the ACF for \(d = D = 0\) and observe a slowly decreasing pattern of autocorrelation coefficients. We decide to apply non-seasonal differencing – when the ACF is re-computed with \(d = 1\) we can see a seasonal trend pattern emerge. Therefore, an additional seasonal differencing operation must be applied. The result shown in the Computation tab shows that \(d = D = 1\) allows us to remove the seasonal and non-seasonal trend.

94.2.6 Task 25

Repeat the previous task but use the Cumulative Periodogram instead.

Interactive Shiny app (click to load).

Open in new tab

The Computation tab shows the result for \(d = D = 1\). It is clear that the “big steps” are not present in the cumulative periodogram at this level of differencing. When the differencing sliders are set back to zero, we can see how the non-seasonal and seasonal patterns are shown.

In fact we know that this is the case. For instance, there is a substantial difference between numeracy scores of males and females.↩︎