9 Naive Bayes Classifier
9.1 Introduction
When scientists talk or write about the “Naive Bayes” method they usually refer to a family of classifiers that apply Bayes’ Theorem under conditional-independence assumptions. The worked example below is a Bernoulli Naive Bayes classifier because it uses document-level presence/absence of word combinations. The term “Naive” relates to the fact that this classifier is easy to compute and makes some restrictive assumptions which might be perceived as being unrealistic in some situations.
For the general Bayesian decision framing (posterior thresholds, Bayes factors, and decision-risk interpretation), see Chapter 113.
9.2 Example
To illustrate the use of a Bernoulli Naive Bayes Classifier, consider the problem of building a simple model that discriminates between real and fake news. If a sufficiently large collection of news articles is available and if we know which articles belong to the real or fake category, then it is possible to extract (not all, but a smart selection of) word combinations that are used within the same item.
For instance, the occurrence of the word combination “royal family” in an article could be investigated to determine the conditional probabilities1 of occurring in real and in fake news. If this word combination occurs in 13 real news articles (out of a total of 1000 randomly selected real articles), then the Likelihood would be \(\text{P} (\text{royal family} | real) = 13 / 1000\). Likewise, if the word combination were found in 87 fake news articles (out of a total of 1000 randomly selected fake news items), then the Likelihood would be \(\text{P} (\text{royal family} | fake) = 87 / 1000\).
Similarly, we can compute the Likelihoods for a wide variety of interesting word combinations and for varying word counts per combination. A few examples of these Likelihoods are shown in Table 9.1 which contains a few word combinations that were found in selected articles and the Likelihoods of finding each combination in real and in fake news.
| Word Count | Word Combination | \(\text{P} (combination | real)\) | \(\text{P} (combination | fake)\) |
|---|---|---|---|
| 1 | royal | 0.081 | 0.092 |
| 1 | family | 0.281 | 0.107 |
| 2 | royal family | 0.013 | 0.087 |
| 1 | shocking | 0.043 | 0.269 |
| 1 | horror | 0.013 | 0.311 |
| 2 | secret wedding | 0.093 | 0.039 |
| 2 | Naive Bayes | 0.001 | 0 |
To examine the text of an entirely new article which contains the word combinations “shocking” and “royal family”, the unconditional (prior) probability of observing a real news item, \(\text{P} (real)\) could be used as a starting point. While this probability could be based on any guess that is made by the researcher (or an external expert that is consulted), it is often computed as the ratio of the number of real news items, divided by the total number of observed items (i.e. this is called a “data-based prior”).
In this case, the news items were randomly obtained from a very large database which contains (almost) all news items that were published in a year. Therefore, it is acceptable to compute the data-based prior probability as \(\text{P} (real) = 736 / 1000 = 0.736\) (because there are 736 real news items in the collection of 1000 randomly selected articles)2.
To obtain a posterior probability for the new article, the prior probability must be multiplied with the likelihoods that correspond to the observed word combinations. In this case
\[ \begin{gather*} \text{P}(real | \text{shocking royal family}) \propto \text{P} (real) \times \text{P} (\text{shocking} | real) \times \text{P} (\text{royal family} | real) = \\ 0.736 \times 0.043 \times 0.013 \simeq 0.00041142 \end{gather*} \]
and
\[ \begin{gather*} \text{P}(fake | \text{shocking royal family}) \propto \text{P} (fake) \times \text{P} (\text{shocking} | fake) \times \text{P} (\text{royal family} | fake) = \\ (1 - 0.736) \times 0.269 \times 0.087 \simeq 0.00617839 \end{gather*} \]
Since the probabilities are proportional to the computed scores, and since the score for fake news is greater than for real news, it is predicted that the newly observed news item is fake. Of course, it is also possible to use the odds formula ( Equation 7.4) so that the actual probability can be computed:
\(\text{P}(real | \text{shocking royal family}) = \frac{0.00041142}{0.00041142+0.00617839} \simeq 0.0624\)
and
\(\text{P}(fake | \text{shocking royal family}) = \frac{0.00617839}{0.00041142+0.00617839} \simeq 0.9376\)
9.3 Interaction Effects
Table 9.1 contains separate Likelihoods for the words “royal”, “family”, and the word combination “royal family” -- the Likelihoods \(\text{P} (\text{royal} | real)\) and \(\text{P} (\text{family} | real)\), however, encompass the combination of both words (both are larger than \(\text{P} (\text{royal family} | real)\)). The same can be said about the Likelihood for fake news items: \(\text{P} (\text{royal} | fake) \geq \text{P} (\text{royal family} | fake)\) and \(\text{P} (\text{family} | fake) \geq \text{P} (\text{royal family} | fake)\)3.
The presence of the word “royal” and “family” in news articles has quite different effects when both are present at the same time or when only one of them is observed. In other words, the effects of these words are interacting with each other -- this can be illustrated by using the Likelihoods of each individual word, rather than the Likelihood of the interaction effect:
\[ \begin{gather*} \text{P}(real | \text{shocking royal family}) \propto \text{P} (real) \times \text{P} (\text{shocking} | real) \times \text{P} (\text{royal} | real) \times \text{P} (\text{family} | real) = \\ 0.736 \times 0.043 \times 0.081 \times 0.281 \simeq 0.0007203401 \end{gather*} \]
and
\[ \begin{gather*} \text{P}(fake | \text{shocking royal family}) \propto \text{P} (fake) \times \text{P} (\text{shocking} | fake) \times \text{P} (\text{royal} | fake) \times \text{P} (\text{family} | fake) = \\ (1 - 0.736) \times 0.269 \times 0.092 \times 0.107 \simeq 0.0006990815 \end{gather*} \]
The corresponding probabilities are:
\(\text{P}(real | \text{shocking royal family}) = \frac{0.0007203401}{0.0007203401+0.0006990815} \simeq 0.507\)
and
\(\text{P}(fake | \text{shocking royal family}) = \frac{0.0006990815}{0.0007203401+0.0006990815} \simeq 0.493\)
The probabilities have changed a lot, and the conclusion is different from the previous computation. Of course, the first conclusion (i.e. stating that the article is fake) is the correct one. We must use the proper Likelihoods that take into account the interaction effects as was done in the first computation. Neglecting interaction effects can lead to wrong conclusions, and this is exactly what the Naive Bayes method does under its conditional-independence assumption unless the researcher explicitly includes the relevant interaction effects.
9.4 Zero Probabilities
Consider the case where a newly observed news article contains the word combination “Naive Bayes” and “shocking” and “horror”. On first sight, one would expect the article to be fake because it contains words that are associated with sensationalism. There is, however, also a zero probability that the word combination “Naive Bayes” is observed in a fake news article (i.e. \(\text{P} (\text{Naive Bayes} | fake) = 0\)) -- this is because we have never observed any such article in the collection of 1000 news items that was used to determine the Likelihoods in Table 9.1. No matter how many sensational words are contained in the article, a Naive Bayes Classifier, that is based on Table 9.1, will always predict the news item to be real instead of fake.
The true probability of observing fake news articles with the words “Naive Bayes” is not exactly zero but slightly larger. This could also be said about any other word combination of interest. Therefore, it is common practice to add a smoothing parameter called \(\alpha \in \mathbb{R}_{\geq 0}\) to the frequency of articles occurring in the real and fake categories. In practice, smoothing uses \(\alpha > 0\) (e.g. Laplace smoothing with \(\alpha = 1\)). A standard add-\(\alpha\) smoothing formula is often written as
\[ \text{P}(\text{word} | \text{class}) = \frac{\text{count}(\text{word}, \text{class}) + \alpha}{\text{total}_{\text{class}} + \alpha |V|} \]
where \(|V|\) denotes the vocabulary size. In Table 9.2, the probabilities are shown for each listed combination separately using a simplified row-wise illustration with 1000 articles per class as the reference denominator (i.e. an effective single displayed combination per row, \(|V| = 1\)), and the values are rounded to three decimals. In other words, the displayed table entries are reproduced row-by-row as \((\text{count}+\alpha)/1000\) within each class. This implies that Table 9.1 can be recomputed for various values of \(\alpha\) as is shown in Table 9.2.
| Combination | \(\text{P} (comb | real)\) \(\alpha = 0\) |
\(\text{P} (comb | fake)\) \(\alpha = 0\) |
\(\text{P} (comb | real)\) \(\alpha = 1\) |
\(\text{P} (comb | fake)\) \(\alpha = 1\) |
\(\text{P} (comb | real)\) \(\alpha = 2\) |
\(\text{P} (comb | fake)\) \(\alpha = 2\) |
|---|---|---|---|---|---|---|
| royal | 0.081 | 0.092 | 0.082 | 0.093 | 0.083 | 0.094 |
| family | 0.281 | 0.107 | 0.282 | 0.108 | 0.283 | 0.109 |
| royal family | 0.013 | 0.087 | 0.014 | 0.088 | 0.015 | 0.089 |
| shocking | 0.043 | 0.269 | 0.044 | 0.270 | 0.045 | 0.271 |
| horror | 0.013 | 0.311 | 0.014 | 0.312 | 0.015 | 0.313 |
| secret wedding | 0.093 | 0.039 | 0.094 | 0.040 | 0.095 | 0.041 |
| Naive Bayes | 0.001 | 0 | 0.002 | 0.001 | 0.003 | 0.002 |
If \(\alpha = 1\) the probability scores can be computed as follows:
\[ \begin{gather*} \text{P}(real | \text{Naive Bayes shocking horror}) \propto \\ \text{P} (real) \times \text{P} (\text{Naive Bayes}_{\alpha = 1} | real) \times \text{P} (\text{shocking}_{\alpha = 1} | real) \times \text{P} (\text{horror}_{\alpha = 1} | real) = \\ 0.736 \times 0.002 \times 0.044 \times 0.014 \simeq 9.06752e-07 \end{gather*} \]
and
\[ \begin{gather*} \text{P}(fake | \text{Naive Bayes shocking horror}) \propto \\ \text{P} (fake) \times \text{P} (\text{Naive Bayes}_{\alpha = 1} | fake) \times \text{P} (\text{shocking}_{\alpha = 1}| fake) \times \text{P} (\text{horror}_{\alpha = 1} | fake) = \\ (1 - 0.736) \times 0.001 \times 0.270 \times 0.312 \simeq 2.223936e-05 \end{gather*} \]
If \(\alpha = 2\) the following scores are obtained:
\[ \begin{gather*} \text{P}(real | \text{Naive Bayes shocking horror}) \propto \\ \text{P} (real) \times \text{P} (\text{Naive Bayes}_{\alpha = 2} | real) \times \text{P} (\text{shocking}_{\alpha = 2} | real) \times \text{P} (\text{horror}_{\alpha = 2} | real) = \\ 0.736 \times 0.003 \times 0.045 \times 0.015 \simeq 1.4904e-06 \end{gather*} \]
and
\[ \begin{gather*} \text{P}(fake | \text{Naive Bayes shocking horror}) \propto \\ \text{P} (fake) \times \text{P} (\text{Naive Bayes}_{\alpha = 2} | fake) \times \text{P} (\text{shocking}_{\alpha = 2}| fake) \times \text{P} (\text{horror}_{\alpha = 2} | fake) = \\ (1 - 0.736) \times 0.002 \times 0.271 \times 0.313 \simeq 4.478654e-05 \end{gather*} \]
In both cases, the conclusion is that the newly observed article is fake, so it doesn’t matter whether \(\alpha\) is chosen to be 1 or 2. In general, however, the choice of \(\alpha\) is determined through a process called “parameter optimisation” and a computer algorithm which evaluates various choices automatically.
9.5 Types of Naive Bayes Classifiers
In the presentation of the Naive Bayes Classifier above, it was implicitly assumed that the likelihood statistics are determined by counting the number of cases (i.e. news articles) that contain a certain word (or word combination), divided by the total number of cases that were screened for each category (i.e. real or fake). This is the Bernoulli Naive Bayes setup. In other words, the underlying dataset that is used to obtain the likelihoods would look similar to Table 9.3 where a value of one indicates that the word is present and a zero that the word is not present.
| label | royal family | shocking | horror | Naive Bayes | … |
|---|---|---|---|---|---|
| real | 0 | 0 | 0 | 0 | … |
| real | 0 | 0 | 0 | 1 | … |
| fake | 1 | 1 | 0 | 0 | … |
| real | 0 | 0 | 0 | 0 | … |
| fake | 1 | 1 | 1 | 0 | … |
| fake | 1 | 0 | 1 | 0 | … |
| fake | 0 | 1 | 0 | 0 | … |
It is, however, also possible to use counts (within each article) instead of a binary variable. This means that the underlying dataset would be similar to Table 9.4.
| label | royal family | shocking | horror | Naive Bayes | … |
|---|---|---|---|---|---|
| real | 0 | 0 | 0 | 0 | … |
| real | 0 | 0 | 0 | 4 | … |
| fake | 18 | 6 | 0 | 0 | … |
| real | 0 | 0 | 0 | 0 | … |
| fake | 3 | 5 | 11 | 0 | … |
| fake | 9 | 0 | 7 | 0 | … |
| fake | 0 | 8 | 0 | 0 | … |
Other types of Naive Bayes Classifiers can be designed by using word combinations in which the order of words is used to compute the likelihoods (rather than assuming that documents are just a “bag of words” without order or grammar). Of course, there are numerous other ways in which the likelihood table can be constructed and, in practice, it turns out that the proper selection of predictors (i.e. columns in the dataset) is often the key to make the Naive Bayes Classifier successful. These predictors are sometimes also called “features” and the process of preparing the predictors (i.e. selecting and computing the measurements to be used in the prediction model) is often referred to as “feature engineering” in the Artificial Intelligence literature.
In accordance with Bayes Theorem of Equation 7.4, these are also called Likelihoods.↩︎
If we would have hand-picked, say, 500 true and 500 false news items, then it would be better to rely on the judgement of experts to elicit a “subjective” prior probability which would likely be much closer to the real unconditional probability that a news item is real.↩︎
This is a direct consequence of Theorem C because if \(\text{P}(B | A)\text{ = P}(B \cap C | A)\text{ + P}(B \cap \neg C | A)\) then \(\text{P}(B | A) \geq \text{ P}(B \cap C | A)\).↩︎