96 The Population
96.1 Introduction
Statistical reasoning often involves the concept of a population which is simply a collection of all subjects that are of interest to the researcher. For example, the students who are enrolled in a statistics course could be defined as a population of interest (assume that there are \(M = 1000\) individuals).
For each individual we measure a property (e.g. the student’s length in cm) and the collected measurements are represented by the variable \(l_i'\) (i.e. \(l_1', l_2', l_3', …, l_M' = l_{1000}'\)). The dataset is presented in the Data tab shown below:
The histogram and the associated frequencies can be computed by using the R module shown below and selecting the “Student lengths” dataset.
The histogram counts cited below assume the settings shown above (10 bins and the default range). The histogram shows that there are many students with a length between 160 cm and 180 cm (more precisely: there are \(151+189+175+178=693\) students in the interval \(]160,180]\)). This is approximately 69% of the population.
96.1.1 Parameters \(\mu\) and \(\sigma^2\)
There are two fundamental properties which can be used to describe the length of all students in our population:
- the Arithmetic Mean \(\mu = \frac{1}{M} \sum_{i=1}^{M} l_i' = 170.5643\) (check this with the Central Tendency module)
- the Variance \(\sigma^2 = \frac{1}{M} \sum_{i=1}^{M} \left( l_i' - \mu \right)^2 = 95.90\) (check this with the Variability module)
The interpretation of \(\mu\) is rather obvious. The parameter \(\sigma^2\), however, has an interpretation that is much less intuitive. Here \(\mu_2\) denotes the second central moment, so \(\sigma^2=\mu_2\) and \(\sigma=\sqrt{\mu_2}\). For this reason, we investigate the number of students which are contained in the following intervals:
- \(]\mu - 1 \sigma,\mu + 1 \sigma]\) or \(]170.5643 - \sqrt{95.9}, 170.5643 + \sqrt{95.9}]\)
- \(]\mu - 2 \sigma,\mu + 2 \sigma]\) or \(]170.5643 - 2*\sqrt{95.9}, 170.5643 + 2*\sqrt{95.9}]\)
- \(]\mu - 3 \sigma,\mu + 3 \sigma]\) or \(]170.5643 - 3*\sqrt{95.9}, 170.5643 + 3*\sqrt{95.9}]\)
We can make an approximate estimate of the numbers of students in each interval by using the frequencies of the histogram:
- \(]170.5643 - \sqrt{95.9}, 170.5643 + \sqrt{95.9}] \simeq ]160.77,180.36]\) which (approximately) contains the lengths of \(151+189+175+178=693\) students (= 69.3%)
- \(]170.5643 - 2*\sqrt{95.9}, 170.5643 + 2*\sqrt{95.9}] \simeq ]150.98,190.15]\) which (approximately) contains the lengths of \(39+87+693+99+45=963\) students (= 96.3%)
- \(]170.5643 - 3*\sqrt{95.9}, 170.5643 + 3*\sqrt{95.9}] \simeq ]141.19,199.94]\) which (approximately) contains the lengths of \(3+12+963+15+6=999\) students (= 99.9%)
Observe how these results are very close to what one would predict based on the assumption that student lengths are normally distributed with E\((l) = 170.5643\) and \(\mu_2 = 95.9\) (here \(\mu_2 = \sigma^2\) denotes the population variance). Is this just a coincidence?
96.2 Using the Normal Model
In the previous section we described the population of students based on the following:
- a frequency table
- a histogram
- the Arithmetic Mean \(\mu = 170.5643\)
- the Standard Deviation \(\sigma = \sqrt{95.9}\)
It is, however, not necessary to add up the frequencies of the histogram to obtain the intervals from the previous section. The reason for this is the fact that the histogram is merely a graphical illustration of how student lengths are distributed among the individuals of the population. It is fair to model the student lengths as a normal distribution because the shape of the histogram can be reasonably well approximated by a Gaussian (i.e. Normal) curve (with \(\mu = 170.5643\) and \(\sigma^2 = 95.9\)). Hint: check this with the ML Fitting module.
In what follows, we switch from the finite observed values \(l_i'\) to a probabilistic model and use \(l\) for the (continuous) length of a randomly selected student.
Hence, it is fair to conclude that the lengths of variable \(l\) are approximately, normally distributed with E\((l) = 170.5643\) (= Arithmetic Mean of the Population) and \(\mu_2 = 95.9\) (= Variance of the Population) which leads to the following mathematical model:
\[ \text{P} \left( \text{E} \left( l \right) - t \sqrt{\mu_2} \leq l \leq \text{E} \left( l \right) + t \sqrt{\mu_2} \right) = \int_{\text{E}(l) - t \sqrt{\mu_2} }^{\text{E}(l) + t \sqrt{\mu_2} } \frac {1}{\sqrt{\mu_2}\sqrt{2\pi }}e^{-\frac {1} {2} \left(\frac{l- \text{E}(l)} {\sqrt{\mu_2}}\right)^2} \, dl \]
or
\[ \begin{gather*}\text{P} \left( 170.5643 - t \sqrt{95.9} \leq l \leq 170.5643 + t \sqrt{95.9} \right) \\= \int_{170.5643 - t \sqrt{95.9} }^{170.5643 + t \sqrt{95.9} } \frac {1}{\sqrt{95.9}\sqrt{2\pi }}e^{-\frac {1} {2} \left(\frac{l-170.5643} {\sqrt{95.9}}\right)^2} \, dl\end{gather*} \]
This implies that the mathematical model can be used to make (approximate) predictions about the frequencies of lengths for any interval that is considered (compare the results from Section 96.1.1 with the results from Section 95.2.2).
Note that the parameters used in the normal probability density function are the Arithmetic Mean (Section 65.2) and (biased) Standard Deviation (Section 66.7) (i.e. the version that divides by \(M\) rather than \(M-1\)). The mathematical model should (generally speaking) not be used with other types of central tendency or variability.
In practice, researchers use various wordings for this model (which all have the same meaning). Here are a few examples:
- the length of students is normally distributed with a mean = 170.5643 and a variance of 95.9
- student lengths have a normal distribution with a mean = 170.5643 and a variance of 95.9
- the population is normal with \(\mu = 170.5643\) and \(\sigma = 9.79\)
- etc.