97 The Sample
97.1 Introduction
In Chapter 96 we assumed that the lengths of all students in the population was available (the total number of observations was \(M=1000\)). To obtain this information it is necessary to make measurements for each individual in the population which is often an impractical, if not impossible, undertaking. Therefore we need a little help from statistics to solve this problem.
The purpose of inferential statistics is to obtain knowledge about the population without the need to measure each individual or item of the population. This is mainly based on a so-called sampling process. The underlying idea of sampling is to obtain measurements from a subset of individuals or items that are taken/drawn from the population. The measurements from the sample are used to infer information about the population.
There are many ways to take/draw samples from a population. In this chapter, however, we will only discuss a special type of sampling process, the so-called simple random sampling. By definition, a simple random sample consists of individuals or items (taken from the population) which are chosen randomly and entirely by chance. In other words, each individual or item in the population has the same probability of being chosen. In this chapter we assume sampling with replacement, so the draws can be treated as independent and identically distributed; for sampling without replacement from a finite population, a finite population correction would be needed for standard errors.
If we take a simple random sample from the student population of Chapter 96 then we obtain a set of measurements \(X_1', X_2', X_3', …, X_N'\) where \(N = 100\) is the so-called sample size. This sample can be used to compute a histogram and the associated frequency table as is shown below:
The results are similar but not exactly the same as in Chapter 96. The reason for the discrepancies is rather obvious: we only took a random sample of 100 measurements from the population. If we repeat the experiment (by taking a new sample) then we would, again, obtain a histogram that looks similar (but is not identical) to the original histogram (only compare the “relative” frequencies and the “shape” of the histogram -- do not compare absolute frequencies!). For each (simple random) sample there are discrepancies between the sample-based histogram and the (true) histogram of the population. Is this something we should worry about?
97.2 Sample Statistics \(m\) and \(s\)
There are two fundamental properties which can be used to describe the lengths of students in our sample (we use the ML Fitting module to estimate the mean and standard deviation of the sample data):
In other words:
- the Arithmetic Mean \(m = \frac{1}{N} \sum_{i=1}^{N} X_i' = 170.444\)
- the sample standard deviation \(s = \sqrt{ \frac{1}{N-1} \sum_{i=1}^{N} \left( X_i' - m \right)^2 } \simeq 9.74\)
Again, the computed mean and standard deviation are close to the true population values (but not exactly equal). If we would take another (simple random) sample, the mean and standard deviation would be different again. For each sample we take the result would be close but not identical to the population parameters.
Observe how the formula for the (sample-based) standard deviation contains \(N-1\) in the denominator -- for populations we would have used \(N\) instead. The mathematical reason why this is the case goes beyond the scope of this chapter. On the other hand, it is obvious that the sample-based standard deviation should reflect the fact that its computation is based on an “estimate” of the sample mean (which has a certain amount of uncertainty). Dividing by \(N-1\) (instead of \(N\)) increases the standard deviation which corresponds with the higher level of uncertainty.
The interpretation of \(m\) is rather obvious. The statistic \(s\), however, has an interpretation that is much less intuitive. For this reason, we investigate the number of students (within our sample) which are contained in the following intervals.
We write \(]a, b]\) for the half-open interval that excludes \(a\) and includes \(b\) (equivalently, \((a, b]\)).
- \(]m - 1 s, m + 1 s]\) or \(]170.444 - 1*9.74, 170.444 + 1*9.74]\)
- \(]m - 2 s, m + 2 s]\) or \(]170.444 - 2*9.74, 170.444 + 2*9.74]\)
- \(]m - 3 s, m + 3 s]\) or \(]170.444 - 3*9.74, 170.444 + 3*9.74]\)
We can make an approximate estimate of the numbers of students in each interval by using the frequencies of the histogram:
- \(]170.444 - 9.74, 170.444 + 9.74] \simeq ]160.704, 180.184]\) which (approximately) contains the lengths of \(15+18+20+15=68\) students (= 68%)
- \(]170.444 - 2*9.74, 170.444 + 2*9.74] \simeq ]150.964, 189.924]\) which (approximately) contains the lengths of \(3+12+68+9+4=96\) students (= 96%)
- \(]170.444 - 3*9.74, 170.444 + 3*9.74] \simeq ]141.224, 199.664]\) which contains the lengths of \(1+96+1+2=100\) students (= 100%)
Is it fair to state that these results are very close to what one would predict based on the assumption that student lengths are normally distributed with E\((X) = 170.444\) and \(s = 9.74\)?
97.3 Inference based on the Normal Model
In the previous section we described a simple random sample of students based on the following:
- a frequency table
- a histogram
- the Arithmetic Mean \(m = 170.444\)
- the Standard Deviation \(s = 9.74\)
It is, however, not necessary to add up the frequencies of the histogram to obtain the intervals from the previous section. The reason for this is the fact that the histogram is merely a graphical illustration of how student lengths are distributed among the individuals of the sample. It is fair to model the student lengths as a normal distribution because the shape of the histogram can be reasonably well approximated by a Gaussian (i.e. Normal) curve (with \(m = 170.444\) and \(s = 9.74\)).
The so-called fit of the Normal Model for the sample histogram can be visually assessed with a wide variety of statistical tools. Here are some computations which provide evidence for the Normal Model:
- Histogram
- Quantiles & Normal QQ Plot
- Kernel Density Plots
- Tukey-lambda PPCC Plot
- Skewness & Kurtosis tests
- Skewness-Kurtosis Plot
In addition to these empirical observations, there are also theoretical reasons to believe that lengths of humans can be modeled by the Normal Distribution:
- there are an infinite number of factors that influence one’s length
- the students in our population are “normal” human beings, implying that there is no reason for their lengths to be biased in some way (e.g. they are not all engaged in either horse racing or basketball)
- the sample has been obtained through “simple random sampling” (with replacement, as assumed in this chapter) which ensures that every individual has an equal probability of being included in each draw (hence the sampling mechanism is not biased)
- under this with-replacement assumption, the individuals of the sample have been drawn from the population independently
Hence, it is fair to conclude that the lengths of variable \(X\) (i.e. the lengths in the sample) are approximately, normally distributed with E\((X) = 170.444\) (= Arithmetic Mean of the Sample) and \(s = 9.74\) (= sample standard deviation of the sample).
It does not make much sense to use our knowledge about the Normal Distribution to create a prediction model for the sample. After all, we are interested in the properties of the population (the sample is just a convenient tool to make inferences about the population).
Therefore we conduct a little “thought experiment” to derive the mathematical model which will help us to make useful predictions about the population (even if only sample data is available). The experiment goes like this:
- define a population which has a property of interest
- measure the Arithmetic Mean of the population and name it \(\mu\)
- measure the biased Variance of the population and name it \(\sigma^2\)
- draw a simple random sample of size \(N\) from the population
- compute the Arithmetic Mean based on the sample and name it \(m_1\)
- put all individuals/items from the sample back into the population
- draw a new simple random sample (independently from the previous one) of size \(N\)
- compute the Arithmetic Mean based on the sample and name it \(m_2\)
- repeat this process until you have a total of \(K\) sample means \(m_1\), \(m_2\), \(m_3\), …, \(m_K\)
Now define a new variable \(\bar{X}\) which represents the sample mean (i.e. Arithmetic Mean) and has an infinite domain. Within the context of this thought experiment, it can be shown that there are three fundamental theorems about \(\bar{X}\).
If the population is normally distributed with \(\mu \in \mathbb{R}\) and \(\sigma \in \mathbb{R}_0^+\) then \(\bar{X}\) is also normally distributed with E\((\bar{X}) = \mu\) and with \(\mu_2 = \frac{\sigma^2}{N}\).
Hence, the probability density function of \(\bar{X}\) is \(\frac{1}{\frac{\sigma}{\sqrt{N}}\sqrt{2 \pi} } e^{-\frac{1}{2} \left( \frac{\bar{X} -\mu}{ \frac{\sigma}{ \sqrt{N} } } \right)^2 }\).
Theorem 1 implies that it is, indeed, possible to use the Arithmetic Mean of a simple random sample to make inferences about the Arithmetic Mean of the population. We have demonstrated that the lengths from the population and the sample are both normally distributed, therefore we can use the Arithmetic Mean from the sample to make a statistical prediction about the Arithmetic Mean of the population (the relationship between \(\mu\) and \(\bar{X}\) is described in the theorem).
Even if the population property does not have a normal distribution, the variable \(\bar{X}\) is approximately normally distributed if \(N\) is sufficiently large and the population variance is finite. The more skewed the population distribution, the larger \(N\) must be for this approximation to work well.
Theorem 2 implies that the Normal Model can be used in a wide range of practical situations (even if no information is available about the distribution of the population).
The population standard deviation \(\sigma\) (from the probability density function of \(\bar{X}\) in Theorem 1) can be estimated by \(s\) (the standard deviation from the sample) if \(N\) is sufficiently large.
Hence, if \(\sigma\) is unknown but \(N\) is sufficiently large, a large-sample approximation for the probability density function of \(\bar{X}\) is \(\frac{1}{\frac{s}{\sqrt{N}}\sqrt{2 \pi} } e^{-\frac{1}{2} \left( \frac{\bar{X} -\mu}{ \frac{s}{ \sqrt{N} } } \right)^2 }\) (equivalently, the plug-in \(z\)-pivot is approximate). For normal populations with unknown \(\sigma\), the exact pivot is \(t = \frac{\bar{X}-\mu}{s/\sqrt{N}}\) with \(N-1\) degrees of freedom (Student-\(t\)). If \(\sigma\) is known, use the exact standard normal pivot from Theorem 1.
Theorem 3 is often used in practice because the population variance is (almost always) unknown.
97.3.1 Example 1
Consider a normally distributed population of student lengths with \(\mu = 171.6\) and \(\sigma = 10\). A simple random sample is drawn from this population with sample size \(N = 25\). What is the probability that the sample mean is between 169.6 and 173.6? What is the probability that a student has a length between these bounds?
The probability density function of \(\bar{X}\) is \(\frac{1}{2 \sqrt{2 \pi}} e^{-\frac{1}{2} \left( \frac{\bar{X} - 171.6}{2} \right)^2 }\) which implies that
\[\text{P}(169.6 \leq \bar{X} \leq 173.6) = \int_{169.6}^{173.6} \frac{1}{2 \sqrt{2 \pi}} e^{-\frac{1}{2} \left( \frac{\bar{X} - 171.6}{2} \right)^2 } \text{d}\bar{X}\]
This type of integral can always be solved through substitution:
\[Z = \frac{\bar{X} - 171.6}{2} \Rightarrow \text{d} \bar{X} = 2 \text{d} Z\]
Hence the integral can be written as:
\[\text{P} \left( \frac{169.6 - 171.6}{2} \leq Z = \frac{\bar{X} - 171.6}{2} \leq \frac{173.6 - 171.6}{2} \right) = \int_{-1}^{1} \frac{1}{2 \sqrt{2 \pi}} e^{-\frac{1}{2} Z^2} 2 \text{d} Z\]
From the Gaussian Table (cfr. Appendix E) if follows that:
\[\int_{0}^{1} \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}Z^2} \text{d} Z = 0.34134\]
This allows us to answer the first question: P\((169.6 \leq \bar{X} \leq 173.6) \simeq 2*0.34134 = 0.68268 = 68.268\%\).
To answer the second question we need to consider the fact that E\((l) = 171.6\) and \(\sigma = 10\). This allows us to write the probability density function of \(l\) as \(\frac{1}{10 \sqrt{2 \pi}} e^{-\frac{1}{2} \left( \frac{l - 171.6}{10} \right)^2 }\) which implies that
\[\text{P}(169.6 \leq l \leq 173.6) = \int_{169.6}^{173.6} \frac{1}{10 \sqrt{2 \pi}} e^{-\frac{1}{2} \left( \frac{l - 171.6}{10} \right)^2 } \text{d}l\]
This type of integral can always be solved through substitution:
\[Z = \frac{l - 171.6}{10} \Rightarrow \text{d} l = 10 \text{d} Z\]
Hence the integral can be written as:
\[\text{P} \left( \frac{169.6 - 171.6}{10} \leq Z = \frac{l - 171.6}{10} \leq \frac{173.6 - 171.6}{10} \right) = \int_{-0.2}^{0.2} \frac{1}{10 \sqrt{2 \pi}} e^{-\frac{1}{2} Z^2} 10 \text{d} Z\]
From the Gaussian Table (cfr. Appendix E) if follows that:
\[\int_{0}^{0.2} \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}Z^2} \text{d} Z = 0.07930\]
This allows us to answer the second question: P\((169.6 \leq l \leq 173.6) \simeq 2*0.0793 = 15.86\%\). Observe that this probability is much smaller than 68.268%.
97.3.2 Example 2
Consider a normally distributed population of student lengths with \(\mu = 171.6\) and unknown \(\sigma\). A simple random sample is drawn from this population with sample size \(N = 25\). What is the probability that the sample mean is between 169.6 and 173.6? What is the probability that a student has a length between these bounds?
Because \(N = 25\) is relatively small, a large-sample normal approximation is not ideal. A full numerical solution is deferred until after the Student-\(t\) distribution is introduced below. In practice, the first question requires an observed sample standard deviation \(s\) to form a \(t\)-statistic, and the second question cannot be determined from \(\mu\) alone because \(\sigma\) is unknown.
97.3.3 Example 3
Consider an asymmetrically distributed population property with \(\mu = 171.6\) and \(\sigma = 10\). A simple random sample is drawn from this population with sample size \(N = 100\). What is the probability that the sample mean is between 169.6 and 173.6? What is the probability that a student has a length between these bounds?
By Theorem 2, the variable \(\bar{X}\) is approximately normally distributed with E\((\bar{X}) = 171.6\), \(\sqrt{\mu_2} = 1\) and probability density function \(\frac{1}{\sqrt{2 \pi} } e^{-\frac{1}{2} \left( \frac{\bar{X} - 171.6}{1} \right)^2 }\). Hence, the first question can be easily answered:
\[\text{P} \left( 169.6 \leq \bar{X} \leq 173.6 \right) \simeq 2 * 0.47725 = 95.45\%\]
The actual lengths of the population have an asymmetric distribution with E\((l) = 171.6\) and \(\sigma = 10\). This information is not sufficient to answer the second question because the actual probability density function is unknown.
97.3.4 Example 4
Consider a normally distributed population property with \(\mu = 171.6\) and \(\sigma = 10\). A simple random sample is drawn from this population with sample size \(N = 25\). Determine a symmetric interval around \(\mu = 171.6\) for which there is a 95% probability that it contains the sample mean.
To solve this problem we need to find \(k\) in
\[\text{P} \left( \mu - k \leq \bar{X} \leq \mu + k \right) = 95\%\]
or
\[\text{P} \left( 171.6 - k \leq \bar{X} \leq 171.6 + k \right) = 95\%\]
Since we know that the variable \(\bar{X}\) is normally distributed with probability density function
\[\frac{1}{2 \sqrt{2 \pi} } e^{- \frac{1}{2} \left( \frac{\bar{X} - 171.6}{2} \right)^2 }\]
it follows that
\[\int_{171.6-k}^{171.6+k} \frac{1}{2 \sqrt{2 \pi}} e^{- \frac{1}{2} \left( \frac{\bar{X} - 171.6}{2} \right)^2 } \text{d} \bar{X} = 0.95\]
which can be solved through substitution
\[Z = \frac{\bar{X} - 171.6}{2} \Rightarrow \text{d} \bar{X} = 2 \text{d} Z\]
Therefore we can write the integral as
\[\int_{\frac{171.6-k-171.6}{2}}^{\frac{171.6+k-171.6}{2}} \frac{1}{2\sqrt{2 \pi}} e^{-\frac{1}{2} Z^2 } 2 \text{d} Z = 0.95\]
or
\[\int_{\frac{-k}{2}}^{\frac{k}{2}} \frac{1}{2\sqrt{2 \pi}} e^{-\frac{1}{2} Z^2 } 2 \text{d} Z = 0.95\]
We search the value \(t\) which corresponds to \(\frac{0.95}{2} = 47.5\%\) in the Gaussian Table (cfr. Appendix E) and use it as an approximation for \(\frac{k}{2}\)
\[\frac{k}{2} \simeq 1.96 \Rightarrow k \simeq 3.92\]
Hence the answer is \(\left[ 171.6 - 3.92; 171.6 + 3.92 \right] = \left[ 167.68; 175.52 \right]\).
97.4 Significant difference of the Arithmetic Mean
A simple random sample of size \(N\) is drawn from a normal population with mean \(\mu\) and standard deviation \(\sigma\). In accordance with the previous exercise we determine the symmetric interval around \(\mu\) for which there is a 95% probability that it contains the sample mean.
To solve this problem we need to find \(k\) in
\[\text{P} \left( \mu - k \leq \bar{X} \leq \mu + k \right) = 95\%\]
Since we know that the variable \(\bar{X}\) is normally distributed with probability density function
\[\frac{1}{\frac{\sigma}{\sqrt{N}} \sqrt{2 \pi} } e^{- \frac{1}{2} \left( \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{N}}} \right)^2 }\]
it follows that
\[\text{P} \left( \mu - k \leq \bar{X} \leq \mu + k \right) = \int_{\mu-k}^{\mu+k} \frac{1}{\frac{\sigma}{\sqrt{N}} \sqrt{2 \pi} } e^{- \frac{1}{2} \left( \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{N}}} \right)^2 } \text{d} \bar{X} = 0.95\]
which can be solved through substitution, implying that the integral can be written as follows
\[\int_{\frac{\mu-k-\mu}{\frac{\sigma}{\sqrt{N}}}}^{\frac{\mu+k-\mu}{\frac{\sigma}{\sqrt{N}}}} \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} Z^2 } \text{d} Z = 0.95\]
or
\[\int_{\frac{-k}{\frac{\sigma}{\sqrt{N}}}}^{\frac{k}{\frac{\sigma}{\sqrt{N}}}} \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} Z^2 } \text{d} Z = 0.95\]
We search the value \(t\) which corresponds to \(\frac{0.95}{2} = 47.5\%\) in the Gaussian Table (cfr. Appendix E) and use it as an approximation for \(\frac{k}{\frac{\sigma}{\sqrt{N}}}\)
\[\frac{k}{\frac{\sigma}{\sqrt{N}}} \simeq 1.96 \Rightarrow k \simeq 1.96 \frac{\sigma}{\sqrt{N}}\]
Hence we conclude that P\(\left[ \mu - 1.96 \frac{\sigma}{\sqrt{N}}; \mu + 1.96 \frac{\sigma}{\sqrt{N}} \right] \simeq 95\%\) for the sample mean \(\bar{X}\) under this model.
In other words: \(\left[ \mu - 1.96 \frac{\sigma}{\sqrt{N}}; \mu + 1.96 \frac{\sigma}{\sqrt{N}} \right]\) is a 95% acceptance region (central sampling interval) for the sample mean \(\bar{X}\) under the stated model. This also implies that there is only a 5% probability that the sample mean is not contained in the interval.
This reasoning is the basis of a hypothesis test: if the observed sample mean \(m\) falls outside the interval, we reject (at the 5% level) the hypothesis that the population mean equals \(\mu\). If the sample mean of a simple random sample is not contained in the interval it is said that the mean is significantly different from \(\mu\) with a significance threshold of 5%. Note that the significance threshold is always a “chosen” value!
97.5 Confidence Interval of the Arithmetic Mean
The symmetric interval of population means around the sample mean (with a significance threshold of 5%) is called the 95% confidence interval of the population mean. The 95% refers to the long-run proportion of such intervals (over repeated sampling) that contain the true \(\mu\); for any one computed interval, \(\mu\) either is or is not inside it. There are several situations for which the confidence interval can be defined.
97.5.1 Normal Population (\(\mu\) is unknown and \(\sigma\) is known)
A simple random sample (of size \(N\)) is drawn from the population and the sample mean \(m\) is computed.
If \(m \in \left[ \mu - 1.96 \frac{\sigma}{\sqrt{N}}, \mu + 1.96 \frac{\sigma}{\sqrt{N}} \right]\) then \(m\) is not significantly different from \(\mu\). In other words, \(m\) is not significantly different from \(\mu\) if \(\left| m - \mu \right| \leq 1.96 \frac{\sigma}{\sqrt{N}}\).
Hence, the collection of population means for which \(m\) is not significantly different are contained in the following “confidence interval”:
\[\left[ m - 1.96 \frac{\sigma}{\sqrt{N}}, m + 1.96 \frac{\sigma}{\sqrt{N}} \right]\]
97.5.2 Asymmetric Population (\(\mu\) is unknown and \(\sigma\) is known)
A simple random sample (of size \(N\)) is drawn from this population and the sample mean \(m\) is computed.
If \(N\) is sufficiently large, the confidence interval is
\[\left[ m - 1.96 \frac{\sigma}{\sqrt{N}}, m + 1.96 \frac{\sigma}{\sqrt{N}} \right]\]
If \(N\) is relatively small, a distribution-free closed-form interval is generally not available; inference then requires stronger assumptions (e.g. approximate normality) or resampling methods.
97.5.3 Population which is not extremely asymmetric (\(\mu\) is unknown and \(\sigma\) is unknown)
A simple random sample (of size \(N\)) is drawn from this population and the sample statistics (Arithmetic Mean \(m\) and the sample standard deviation \(s\)) are computed.
If \(N\) is sufficiently large, the confidence interval is
\[\left[ m - 1.96 \frac{s}{\sqrt{N}}, m + 1.96 \frac{s}{\sqrt{N}} \right]\]
If \(N\) is relatively small, the normal-approximation interval is not reliable; use a Student-\(t\) interval under approximate normality, or use robust/bootstrapped intervals.
97.6 Student t-distribution
97.6.1 Degrees of Freedom
To compute a 95% confidence interval around the sample mean (unless otherwise stated we always assume a simple random sample) we need information about
- \(\frac{\sigma}{\sqrt{N}}\) or \(\frac{s}{\sqrt{N}}\)
- the critical value (1.96) from the Gaussian Table (Appendix E)
As explained before, the critical value corresponds to
\[\text{P} \left( -1.96 \leq Z \leq 1.96 \right) = 95\%\]
where \(Z\) is defined as the standard normal variable
\[Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{N}}}\]
If the sample size \(N\) is sufficiently large, it is possible to use \(\frac{s}{\sqrt{N}}\) as an approximation of \(\frac{\sigma}{\sqrt{N}}\). In other words, \(Z = \frac{\bar{X} - \mu}{\frac{s}{\sqrt{N}}}\) has (approximately) a standard normal distribution for which the 95% confidence interval can be found with the Gaussian Table.
If, however, \(\sigma\) is unknown and the population is exactly normal, then \(T = \frac{\bar{X} - \mu}{\frac{s}{\sqrt{N}}}\) has a Student \(t\) distribution with \(N-1\) degrees of freedom. If the population is only approximately normal, then \(T = \frac{\bar{X} - \mu}{\frac{s}{\sqrt{N}}}\) is only approximately Student \(t\) distributed (with the quality of the approximation depending on \(N\) and the degree of non-normality). As the degrees of freedom increase, the Student \(t\) distribution converges to the standard normal distribution, so the normal approximation becomes accurate for large \(N\). For non-normal populations in small samples, robustness is limited.
When dealing with small samples we need to take into account the so-called “degrees of freedom” (\(DF\)) which are loosely defined as the sample size \(N\) minus the number of parameters of the model \(K\): \(DF = N - K\).
So far, we have only considered models with just 1 parameter: the Arithmetic Mean. Hence, the value of \(K\) is equal to 1.
The degrees of freedom can be thought of as the number of observations that can cause uncertainty in our statistical model -- given that \(K = 1\) we can derive the degrees of freedom for small samples as explained in the cases below.
97.6.1.1 Case 1: \(N = 1\)
If \(N = 1 \Rightarrow DF = 0\). If we only have a single observation in our sample then the Arithmetic Mean is exactly equal to the observed value: \(\bar{X} = \frac{ 1} {N=1 } \sum_{i=1}^{N=1} X_i = X_1\). If \(\bar{X}\) is used to predict the sample values the prediction error \(E_1\) is zero: \(E_1 = X_1 - \bar{X} = 0\). In other words, there is no uncertainty in the statistical model.
This lack of uncertainty is also reflected by the fact that the (biased) sample variance is zero: \(V(X) = s_X^2 = \frac{1}{N=1} \sum_{i=1}^{N=1} \left( X_i - \bar{X} \right)^2 = \left( X_1 - \bar{X} \right)^2 = 0\). It is obvious that it does not make sense to study probability intervals if the variance is zero.
This may intuitively explain why the “sample variance” (i.e. the “unbiased” Variance that was defined in Section 66.6) is defined with a denominator \(N-1\) (instead of \(N\)). The reason is that we first need to compute the Arithmetic Mean based on the sample values -- the degrees of freedom which can contribute to the statistical uncertainty are equal to zero when \(N=1\). The fact that the unbiased Variance is defined with \(N-1\) in the denominator makes a lot of sense because we are never allowed to divide by zero (hence the unbiased Variance does not exist when \(N=1\)).
97.6.1.2 Case 2: \(N = 2\)
If \(N = 2 \Rightarrow DF = 1\). When two observations are available then it is possible for the Variance to be meaningful as a measure of uncertainty (i.e. \(V(X) > 0\)). This is the case when \(X_1 \neq X_2\) which implies that either \(X_1 > X_2\) or \(X_1 < X_2\). In both cases the mean \(\bar{X} = \frac{X_1 + X_2}{2}\) is not a perfect predictor of the sample observations: \(\bar{X} \neq X_1 \neq X_2 \Rightarrow E_1 = X_1 - \bar{X} \neq 0 \wedge E_2 = X_2 - \bar{X} \neq 0\).
The unbiased sample Variance is given by the formula
\[V_{sam}(X) = \frac{(X_1 - \bar{X})^2 + (X_2 - \bar{X})^2}{2-1} > 0\]
which has twice the size of the (biased) Variance formula that would be used if the sample would be equal to the entire population
\[V_{pop}(X) = \frac{(X_1 - \bar{X})^2 + (X_2 - \bar{X})^2}{2} > 0\]
The fact that \(V_{sam}(X) > V_{pop}(X)\) is consistent with stating that predictions made about the sample values (based on the Arithmetic Sample Mean) have a bigger uncertainty than predictions made about the population values (based on the Arithmetic Mean of the Population).
Consider the special case where the sample values are as follows: \((2, 3)\). In addition, suppose that, for some unknown reason, we don’t want to use the Arithmetic Mean as a predictor \(c\) for the sample values -- we choose \(c = X_1\) as a predictor instead of \(\bar{X}\). How many observations can contribute to the uncertainty of our model?
The answer is straightforward because the prediction errors are: \(E_1 = X_1 - c = X_1 - X_1 = 0\) and \(E_2 = X_2 - c = X_2 - X_1 \neq 0\) (because we assumed that \(X_1 \neq X_2\)).
In other words, there is at least 1 observation which can contribute to the prediction uncertainty.
97.6.1.3 Case 3: \(N = 3\)
If \(N = 3 \Rightarrow DF = 2\). When three observations are available, it is likely that the Variance represents a meaningful measure of uncertainty (we assume observations which are different from each other: \(X_1 \neq X_2 \neq X_3\)). This assumption implies that \(\bar{X}\) is not a perfect predictor of the observations which leads to uncertainty and a positive sample variance \(V_{sam}(X) > 0\).
Consider the special case where the sample values are \(\left( 3, 5, 4 \right)\). In this case the sample mean is exactly equal to the last observation: \(\bar{X} = X_3\) which means that there are only two observations that can contribute to the prediction uncertainty (hence \(DF = 2\)).
In other words, there are at least 2 observations which can contribute to the prediction uncertainty.
97.6.1.4 Case 4: \(N > 3\)
As before, similar arguments can be made to explain the degrees of freedom that contribute to the prediction uncertainty. As \(N\) becomes larger (and provided that \(K\) is small) the degrees of freedom become less important.
The Student \(t\)-distribution is suited for small sample sizes and explicitly takes into account the degrees of freedom. Therefore it makes sense to use the \(t\)-distribution when \(N\) is not large.
However, the question remains of what happens when \(N\) is large and we still use the \(t\)-distribution? The answer can be found in Appendix F which contains the critical values for the \(t_{1-\alpha,\nu}\) distribution (\(\alpha\) is the significance level, and \(\nu\) represents the degrees of freedom).
Consider the case where we are interested in computing a 95% confidence interval around the sample mean \(m\). We need to find the critical value \(t_{0.95,\nu=N-1}\) in the \(t\)-table of Appendix F such that
\[\text{P} \left( -t_{0.95,N-1} \leq t \leq t_{0.95,N-1} \right) = 0.95\]
from which we derive the confidence interval:
\[\left[ m - t_{0.95,N-1} \frac{s}{\sqrt{N}}, m + t_{0.95,N-1} \frac{s}{\sqrt{N}} \right]\]
First we draw a simple random sample with sample size \(N = 25\). It follows that
\[t = \frac{\bar{X} - \mu}{\frac{s}{\sqrt{25}}}\]
has a Student \(t\)-distribution with \(\nu = 25 -1 = 24\) degrees of freedom. The critical value in the t-table is found in the cell which corresponds to column for \(1 - \frac{\alpha}{2} = 0.975\) (the \(t\)-distribution is symmetric) and the row for \(\nu = 24\): \(t = 2.064\).
Suppose that \(m = 171.6\) and \(s = 9.6\) then it follows that the 95% confidence interval is
\[171.6 \pm 2.064 \frac{9.6}{5} \simeq 171.6 \pm 3.96 \simeq [167.6, 175.6]\]
How would the confidence interval change if \(N\) increases? To answer this question, we examine the critical values ranging from \(\nu = 24\) until \(\nu = \infty\) in the column which corresponds to \(1 - \frac{\alpha}{2} = 0.975\). The \(t\)-values converge towards \(t = 1.960\) (see Appendix F). If we compare this with the critical value that corresponds to \(0.95 / 2 = 0.475\) in the Gaussian Table (Appendix E) we observe that they are both identical!
The conclusion is that the \(t\)-distribution converges to the (Gaussian) Normal distribution for \(N \rightarrow \infty\). Now we can answer the question of what happens when \(N\) is large and we still use the \(t\)-distribution. The answer is: we can always use the \(t\)-distribution (even for large samples) because when \(N\) becomes large, the \(t\)-distribution converges to a Normal distribution.