• Descriptive
    • Moments
    • Concentration
    • Central Tendency
    • Variability
    • Stem-and-Leaf Plot
    • Histogram & Frequency Table
    • Data Quality Forensics
    • Conditional EDA
    • Quantiles
    • Kernel Density Estimation
    • Normal QQ Plot
    • Bootstrap Plot

    • Multivariate Descriptive Statistics
  • Distributions
    • Binomial Probabilities
    • Geometric Probabilities
    • Negative Binomial Probabilities
    • Hypergeometric Probabilities
    • Multinomial Probabilities
    • Dirichlet
    • Poisson Probabilities

    • Exponential
    • Gamma
    • Erlang
    • Weibull
    • Rayleigh
    • Maxwell-Boltzmann
    • Lognormal
    • Pareto
    • Inverse Gamma
    • Inverse Chi-Square

    • Beta
    • Power
    • Beta Prime (Inv. Beta)
    • Triangular

    • Normal (area)
    • Logistic
    • Laplace
    • Cauchy (standard)
    • Cauchy (location-scale)
    • Gumbel
    • Fréchet
    • Generalized Extreme Value

    • Normal RNG
    • ML Fitting
    • Tukey Lambda PPCC
    • Box-Cox Normality Plot
    • Noncentral t
    • Noncentral F
    • Sample Correlation r

    • Empirical Tests
  • Hypotheses
    • Theoretical Aspects of Hypothesis Testing
    • Bayesian Inference
    • Minimum Sample Size

    • Empirical Tests
    • Multivariate (pair-wise) Testing
  • Models
    • Manual Model Building
    • Guided Model Building
  • Time Series
    • Time Series Plot
    • Decomposition
    • Exponential Smoothing

    • Blocked Bootstrap Plot
    • Mean Plot
    • (P)ACF
    • VRM
    • Standard Deviation-Mean Plot
    • Spectral Analysis
    • ARIMA

    • Cross Correlation Function
    • Granger Causality
  1. Introduction to Probability
  2. 9  Naive Bayes Classifier
  • Preface
  • Getting Started
    • 1  Introduction
    • 2  Why Do We Need Innovative Technology?
    • 3  Basic Definitions
    • 4  The Big Picture: Why We Analyze Data
  • Introduction to Probability
    • 5  Definitions of Probability
    • 6  Jeffreys’ axiom system
    • 7  Bayes’ Theorem
    • 8  Sensitivity and Specificity
    • 9  Naive Bayes Classifier
    • 10  Law of Large Numbers

    • 11  Problems
  • Probability Distributions
    • 12  Bernoulli Distribution
    • 13  Binomial Distribution
    • 14  Geometric Distribution
    • 15  Negative Binomial Distribution
    • 16  Hypergeometric Distribution
    • 17  Multinomial Distribution
    • 18  Poisson Distribution

    • 19  Uniform Distribution (Rectangular Distribution)
    • 20  Normal Distribution (Gaussian Distribution)
    • 21  Gaussian Naive Bayes Classifier
    • 22  Chi Distribution
    • 23  Chi-squared Distribution (1 parameter)
    • 24  Chi-squared Distribution (2 parameters)
    • 25  Student t-Distribution
    • 26  Fisher F-Distribution
    • 27  Exponential Distribution
    • 28  Lognormal Distribution
    • 29  Gamma Distribution
    • 30  Beta Distribution
    • 31  Weibull Distribution
    • 32  Pareto Distribution
    • 33  Inverse Gamma Distribution
    • 34  Rayleigh Distribution
    • 35  Erlang Distribution
    • 36  Logistic Distribution
    • 37  Laplace Distribution
    • 38  Gumbel Distribution
    • 39  Cauchy Distribution
    • 40  Triangular Distribution
    • 41  Power Distribution
    • 42  Beta Prime Distribution
    • 43  Sample Correlation Distribution
    • 44  Dirichlet Distribution
    • 45  Generalized Extreme Value (GEV) Distribution
    • 46  Frechet Distribution
    • 47  Noncentral t Distribution
    • 48  Noncentral F Distribution
    • 49  Inverse Chi-Squared Distribution
    • 50  Maxwell-Boltzmann Distribution
    • 51  Distribution Relationship Map

    • 52  Problems
  • Descriptive Statistics & Exploratory Data Analysis
    • 53  Types of Data
    • 54  Datasheets

    • 55  Frequency Plot (Bar Plot)
    • 56  Frequency Table
    • 57  Contingency Table
    • 58  Binomial Classification Metrics
    • 59  Confusion Matrix
    • 60  ROC Analysis

    • 61  Stem-and-Leaf Plot
    • 62  Histogram
    • 63  Data Quality Forensics
    • 64  Quantiles
    • 65  Central Tendency
    • 66  Variability
    • 67  Skewness & Kurtosis
    • 68  Concentration
    • 69  Notched Boxplot
    • 70  Scatterplot
    • 71  Pearson Correlation
    • 72  Rank Correlation
    • 73  Partial Pearson Correlation
    • 74  Simple Linear Regression
    • 75  Moments
    • 76  Quantile-Quantile Plot (QQ Plot)
    • 77  Normal Probability Plot
    • 78  Probability Plot Correlation Coefficient Plot (PPCC Plot)
    • 79  Box-Cox Normality Plot
    • 80  Kernel Density Estimation
    • 81  Bivariate Kernel Density Plot
    • 82  Conditional EDA: Panel Diagnostics
    • 83  Bootstrap Plot (Central Tendency)
    • 84  Survey Scores Rank Order Comparison
    • 85  Cronbach Alpha

    • 86  Equi-distant Time Series
    • 87  Time Series Plot (Run Sequence Plot)
    • 88  Mean Plot
    • 89  Blocked Bootstrap Plot (Central Tendency)
    • 90  Standard Deviation-Mean Plot
    • 91  Variance Reduction Matrix
    • 92  (Partial) Autocorrelation Function
    • 93  Periodogram & Cumulative Periodogram

    • 94  Problems
  • Hypothesis Testing
    • 95  Normal Distributions revisited
    • 96  The Population
    • 97  The Sample
    • 98  The One-Sided Hypothesis Test
    • 99  The Two-Sided Hypothesis Test
    • 100  When to use a one-sided or two-sided test?
    • 101  What if \(\sigma\) is unknown?
    • 102  The Central Limit Theorem (revisited)
    • 103  Statistical Test of the Population Mean with known Variance
    • 104  Statistical Test of the Population Mean with unknown Variance
    • 105  Statistical Test of the Variance
    • 106  Statistical Test of the Population Proportion
    • 107  Statistical Test of the Standard Deviation \(\sigma\)
    • 108  Statistical Test of the difference between Means -- Independent/Unpaired Samples
    • 109  Statistical Test of the difference between Means -- Dependent/Paired Samples
    • 110  Statistical Test of the difference between Variances -- Independent/Unpaired Samples

    • 111  Hypothesis Testing for Research Purposes
    • 112  Decision Thresholds, Alpha, and Confidence Levels
    • 113  Bayesian Inference for Decision-Making
    • 114  One Sample t-Test
    • 115  Skewness & Kurtosis Tests
    • 116  Paired Two Sample t-Test
    • 117  Wilcoxon Signed-Rank Test
    • 118  Unpaired Two Sample t-Test
    • 119  Unpaired Two Sample Welch Test
    • 120  Two One-Sided Tests (TOST) for Equivalence
    • 121  Mann-Whitney U test (Wilcoxon Rank-Sum Test)
    • 122  Bayesian Two Sample Test
    • 123  Median Test based on Notched Boxplots
    • 124  Chi-Squared Tests for Count Data
    • 125  Kolmogorov-Smirnov Test
    • 126  One Way Analysis of Variance (1-way ANOVA)
    • 127  Kruskal-Wallis Test
    • 128  Two Way Analysis of Variance (2-way ANOVA)
    • 129  Repeated Measures ANOVA
    • 130  Friedman Test
    • 131  Testing Correlations
    • 132  A Note on Causality

    • 133  Problems
  • Regression Models
    • 134  Simple Linear Regression Model (SLRM)
    • 135  Multiple Linear Regression Model (MLRM)
    • 136  Logistic Regression
    • 137  Generalized Linear Models
    • 138  Multinomial and Ordinal Logistic Regression
    • 139  Cox Proportional Hazards Regression
    • 140  Conditional Inference Trees
    • 141  Leaf Diagnostics for Conditional Inference Trees
    • 142  Conditional Random Forests
    • 143  Hypothesis Testing with Linear Regression Models (from a Practical Point of View)

    • 144  Problems
  • Introduction to Time Series Analysis
    • 145  Case: the Market of Health and Personal Care Products
    • 146  Decomposition of Time Series
    • 147  Ad hoc Forecasting of Time Series
  • Box-Jenkins Analysis
    • 148  Introduction to Box-Jenkins Analysis
    • 149  Theoretical Concepts
    • 150  Stationarity
    • 151  Identifying ARMA parameters
    • 152  Estimating ARMA Parameters and Residual Diagnostics
    • 153  Forecasting with ARIMA models
    • 154  Intervention Analysis
    • 155  Cross-Correlation Function
    • 156  Transfer Function Noise Models
    • 157  General-to-Specific Modeling
  • Model Building Strategies
    • 158  Introduction to Model Building Strategies
    • 159  Manual Model Building
    • 160  Model Validation
    • 161  Regularization Methods
    • 162  Hyperparameter Optimization Strategies
    • 163  Guided Model Building in Practice
    • 164  Diagnostics, Revision, and Guided Forecasting
    • 165  Leakage, Target Encoding, and Robust Regression
  • References
  • Appendices
    • Appendices
    • A  Method Selection Guide
    • B  Presentations and Teaching Materials
    • C  R Language Concepts for Statistical Computing
    • D  Matrix Algebra
    • E  Standard Normal Table (Gaussian Table)
    • F  Critical values of Student’s \(t\) distribution with \(\nu\) degrees of freedom
    • G  Upper-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom
    • H  Lower-tail critical values of the \(\chi^2\)-distribution with \(\nu\) degrees of freedom

Table of contents

  • 9.1 Introduction
  • 9.2 Example
  • 9.3 Interaction Effects
  • 9.4 Zero Probabilities
  • 9.5 Types of Naive Bayes Classifiers
  1. Introduction to Probability
  2. 9  Naive Bayes Classifier

9  Naive Bayes Classifier

9.1 Introduction

When scientists talk or write about the “Naive Bayes” method they usually refer to a family of classifiers that apply Bayes’ Theorem under conditional-independence assumptions. The worked example below is a Bernoulli Naive Bayes classifier because it uses document-level presence/absence of word combinations. The term “Naive” relates to the fact that this classifier is easy to compute and makes some restrictive assumptions which might be perceived as being unrealistic in some situations.

For the general Bayesian decision framing (posterior thresholds, Bayes factors, and decision-risk interpretation), see Chapter 113.

9.2 Example

To illustrate the use of a Bernoulli Naive Bayes Classifier, consider the problem of building a simple model that discriminates between real and fake news. If a sufficiently large collection of news articles is available and if we know which articles belong to the real or fake category, then it is possible to extract (not all, but a smart selection of) word combinations that are used within the same item.

For instance, the occurrence of the word combination “royal family” in an article could be investigated to determine the conditional probabilities1 of occurring in real and in fake news. If this word combination occurs in 13 real news articles (out of a total of 1000 randomly selected real articles), then the Likelihood would be \(\text{P} (\text{royal family} | real) = 13 / 1000\). Likewise, if the word combination were found in 87 fake news articles (out of a total of 1000 randomly selected fake news items), then the Likelihood would be \(\text{P} (\text{royal family} | fake) = 87 / 1000\).

Similarly, we can compute the Likelihoods for a wide variety of interesting word combinations and for varying word counts per combination. A few examples of these Likelihoods are shown in Table 9.1 which contains a few word combinations that were found in selected articles and the Likelihoods of finding each combination in real and in fake news.

Table 9.1: Likelihoods of Word Combinations in Real and Fake News Articles (Examples)
Word Count Word Combination \(\text{P} (combination | real)\) \(\text{P} (combination | fake)\)
1 royal 0.081 0.092
1 family 0.281 0.107
2 royal family 0.013 0.087
1 shocking 0.043 0.269
1 horror 0.013 0.311
2 secret wedding 0.093 0.039
2 Naive Bayes 0.001 0

To examine the text of an entirely new article which contains the word combinations “shocking” and “royal family”, the unconditional (prior) probability of observing a real news item, \(\text{P} (real)\) could be used as a starting point. While this probability could be based on any guess that is made by the researcher (or an external expert that is consulted), it is often computed as the ratio of the number of real news items, divided by the total number of observed items (i.e. this is called a “data-based prior”).

In this case, the news items were randomly obtained from a very large database which contains (almost) all news items that were published in a year. Therefore, it is acceptable to compute the data-based prior probability as \(\text{P} (real) = 736 / 1000 = 0.736\) (because there are 736 real news items in the collection of 1000 randomly selected articles)2.

To obtain a posterior probability for the new article, the prior probability must be multiplied with the likelihoods that correspond to the observed word combinations. In this case

\[ \begin{gather*} \text{P}(real | \text{shocking royal family}) \propto \text{P} (real) \times \text{P} (\text{shocking} | real) \times \text{P} (\text{royal family} | real) = \\ 0.736 \times 0.043 \times 0.013 \simeq 0.00041142 \end{gather*} \]

and

\[ \begin{gather*} \text{P}(fake | \text{shocking royal family}) \propto \text{P} (fake) \times \text{P} (\text{shocking} | fake) \times \text{P} (\text{royal family} | fake) = \\ (1 - 0.736) \times 0.269 \times 0.087 \simeq 0.00617839 \end{gather*} \]

Since the probabilities are proportional to the computed scores, and since the score for fake news is greater than for real news, it is predicted that the newly observed news item is fake. Of course, it is also possible to use the odds formula ( Equation 7.4) so that the actual probability can be computed:

\(\text{P}(real | \text{shocking royal family}) = \frac{0.00041142}{0.00041142+0.00617839} \simeq 0.0624\)

and

\(\text{P}(fake | \text{shocking royal family}) = \frac{0.00617839}{0.00041142+0.00617839} \simeq 0.9376\)

9.3 Interaction Effects

Table 9.1 contains separate Likelihoods for the words “royal”, “family”, and the word combination “royal family” -- the Likelihoods \(\text{P} (\text{royal} | real)\) and \(\text{P} (\text{family} | real)\), however, encompass the combination of both words (both are larger than \(\text{P} (\text{royal family} | real)\)). The same can be said about the Likelihood for fake news items: \(\text{P} (\text{royal} | fake) \geq \text{P} (\text{royal family} | fake)\) and \(\text{P} (\text{family} | fake) \geq \text{P} (\text{royal family} | fake)\)3.

The presence of the word “royal” and “family” in news articles has quite different effects when both are present at the same time or when only one of them is observed. In other words, the effects of these words are interacting with each other -- this can be illustrated by using the Likelihoods of each individual word, rather than the Likelihood of the interaction effect:

\[ \begin{gather*} \text{P}(real | \text{shocking royal family}) \propto \text{P} (real) \times \text{P} (\text{shocking} | real) \times \text{P} (\text{royal} | real) \times \text{P} (\text{family} | real) = \\ 0.736 \times 0.043 \times 0.081 \times 0.281 \simeq 0.0007203401 \end{gather*} \]

and

\[ \begin{gather*} \text{P}(fake | \text{shocking royal family}) \propto \text{P} (fake) \times \text{P} (\text{shocking} | fake) \times \text{P} (\text{royal} | fake) \times \text{P} (\text{family} | fake) = \\ (1 - 0.736) \times 0.269 \times 0.092 \times 0.107 \simeq 0.0006990815 \end{gather*} \]

The corresponding probabilities are:

\(\text{P}(real | \text{shocking royal family}) = \frac{0.0007203401}{0.0007203401+0.0006990815} \simeq 0.507\)

and

\(\text{P}(fake | \text{shocking royal family}) = \frac{0.0006990815}{0.0007203401+0.0006990815} \simeq 0.493\)

The probabilities have changed a lot, and the conclusion is different from the previous computation. Of course, the first conclusion (i.e. stating that the article is fake) is the correct one. We must use the proper Likelihoods that take into account the interaction effects as was done in the first computation. Neglecting interaction effects can lead to wrong conclusions, and this is exactly what the Naive Bayes method does under its conditional-independence assumption unless the researcher explicitly includes the relevant interaction effects.

9.4 Zero Probabilities

Consider the case where a newly observed news article contains the word combination “Naive Bayes” and “shocking” and “horror”. On first sight, one would expect the article to be fake because it contains words that are associated with sensationalism. There is, however, also a zero probability that the word combination “Naive Bayes” is observed in a fake news article (i.e. \(\text{P} (\text{Naive Bayes} | fake) = 0\)) -- this is because we have never observed any such article in the collection of 1000 news items that was used to determine the Likelihoods in Table 9.1. No matter how many sensational words are contained in the article, a Naive Bayes Classifier, that is based on Table 9.1, will always predict the news item to be real instead of fake.

The true probability of observing fake news articles with the words “Naive Bayes” is not exactly zero but slightly larger. This could also be said about any other word combination of interest. Therefore, it is common practice to add a smoothing parameter called \(\alpha \in \mathbb{R}_{\geq 0}\) to the frequency of articles occurring in the real and fake categories. In practice, smoothing uses \(\alpha > 0\) (e.g. Laplace smoothing with \(\alpha = 1\)). A standard add-\(\alpha\) smoothing formula is often written as

\[ \text{P}(\text{word} | \text{class}) = \frac{\text{count}(\text{word}, \text{class}) + \alpha}{\text{total}_{\text{class}} + \alpha |V|} \]

where \(|V|\) denotes the vocabulary size. In Table 9.2, the probabilities are shown for each listed combination separately using a simplified row-wise illustration with 1000 articles per class as the reference denominator (i.e. an effective single displayed combination per row, \(|V| = 1\)), and the values are rounded to three decimals. In other words, the displayed table entries are reproduced row-by-row as \((\text{count}+\alpha)/1000\) within each class. This implies that Table 9.1 can be recomputed for various values of \(\alpha\) as is shown in Table 9.2.

Table 9.2: Likelihoods of Word Combinations in Real and Fake News Articles (with varying \(\alpha\))
Combination

\(\text{P} (comb | real)\)

\(\alpha = 0\)

\(\text{P} (comb | fake)\)

\(\alpha = 0\)

\(\text{P} (comb | real)\)

\(\alpha = 1\)

\(\text{P} (comb | fake)\)

\(\alpha = 1\)

\(\text{P} (comb | real)\)

\(\alpha = 2\)

\(\text{P} (comb | fake)\)

\(\alpha = 2\)

royal 0.081 0.092 0.082 0.093 0.083 0.094
family 0.281 0.107 0.282 0.108 0.283 0.109
royal family 0.013 0.087 0.014 0.088 0.015 0.089
shocking 0.043 0.269 0.044 0.270 0.045 0.271
horror 0.013 0.311 0.014 0.312 0.015 0.313
secret wedding 0.093 0.039 0.094 0.040 0.095 0.041
Naive Bayes 0.001 0 0.002 0.001 0.003 0.002

If \(\alpha = 1\) the probability scores can be computed as follows:

\[ \begin{gather*} \text{P}(real | \text{Naive Bayes shocking horror}) \propto \\ \text{P} (real) \times \text{P} (\text{Naive Bayes}_{\alpha = 1} | real) \times \text{P} (\text{shocking}_{\alpha = 1} | real) \times \text{P} (\text{horror}_{\alpha = 1} | real) = \\ 0.736 \times 0.002 \times 0.044 \times 0.014 \simeq 9.06752e-07 \end{gather*} \]

and

\[ \begin{gather*} \text{P}(fake | \text{Naive Bayes shocking horror}) \propto \\ \text{P} (fake) \times \text{P} (\text{Naive Bayes}_{\alpha = 1} | fake) \times \text{P} (\text{shocking}_{\alpha = 1}| fake) \times \text{P} (\text{horror}_{\alpha = 1} | fake) = \\ (1 - 0.736) \times 0.001 \times 0.270 \times 0.312 \simeq 2.223936e-05 \end{gather*} \]

If \(\alpha = 2\) the following scores are obtained:

\[ \begin{gather*} \text{P}(real | \text{Naive Bayes shocking horror}) \propto \\ \text{P} (real) \times \text{P} (\text{Naive Bayes}_{\alpha = 2} | real) \times \text{P} (\text{shocking}_{\alpha = 2} | real) \times \text{P} (\text{horror}_{\alpha = 2} | real) = \\ 0.736 \times 0.003 \times 0.045 \times 0.015 \simeq 1.4904e-06 \end{gather*} \]

and

\[ \begin{gather*} \text{P}(fake | \text{Naive Bayes shocking horror}) \propto \\ \text{P} (fake) \times \text{P} (\text{Naive Bayes}_{\alpha = 2} | fake) \times \text{P} (\text{shocking}_{\alpha = 2}| fake) \times \text{P} (\text{horror}_{\alpha = 2} | fake) = \\ (1 - 0.736) \times 0.002 \times 0.271 \times 0.313 \simeq 4.478654e-05 \end{gather*} \]

In both cases, the conclusion is that the newly observed article is fake, so it doesn’t matter whether \(\alpha\) is chosen to be 1 or 2. In general, however, the choice of \(\alpha\) is determined through a process called “parameter optimisation” and a computer algorithm which evaluates various choices automatically.

9.5 Types of Naive Bayes Classifiers

In the presentation of the Naive Bayes Classifier above, it was implicitly assumed that the likelihood statistics are determined by counting the number of cases (i.e. news articles) that contain a certain word (or word combination), divided by the total number of cases that were screened for each category (i.e. real or fake). This is the Bernoulli Naive Bayes setup. In other words, the underlying dataset that is used to obtain the likelihoods would look similar to Table 9.3 where a value of one indicates that the word is present and a zero that the word is not present.

Table 9.3: Dataset 1 for Naive Bayes Classifier Example
label royal family shocking horror Naive Bayes …
real 0 0 0 0 …
real 0 0 0 1 …
fake 1 1 0 0 …
real 0 0 0 0 …
fake 1 1 1 0 …
fake 1 0 1 0 …
fake 0 1 0 0 …

It is, however, also possible to use counts (within each article) instead of a binary variable. This means that the underlying dataset would be similar to Table 9.4.

Table 9.4: Dataset 2 for Naive Bayes Classifier Example
label royal family shocking horror Naive Bayes …
real 0 0 0 0 …
real 0 0 0 4 …
fake 18 6 0 0 …
real 0 0 0 0 …
fake 3 5 11 0 …
fake 9 0 7 0 …
fake 0 8 0 0 …

Other types of Naive Bayes Classifiers can be designed by using word combinations in which the order of words is used to compute the likelihoods (rather than assuming that documents are just a “bag of words” without order or grammar). Of course, there are numerous other ways in which the likelihood table can be constructed and, in practice, it turns out that the proper selection of predictors (i.e. columns in the dataset) is often the key to make the Naive Bayes Classifier successful. These predictors are sometimes also called “features” and the process of preparing the predictors (i.e. selecting and computing the measurements to be used in the prediction model) is often referred to as “feature engineering” in the Artificial Intelligence literature.


  1. In accordance with Bayes Theorem of Equation 7.4, these are also called Likelihoods.↩︎

  2. If we would have hand-picked, say, 500 true and 500 false news items, then it would be better to rely on the judgement of experts to elicit a “subjective” prior probability which would likely be much closer to the real unconditional probability that a news item is real.↩︎

  3. This is a direct consequence of Theorem C because if \(\text{P}(B | A)\text{ = P}(B \cap C | A)\text{ + P}(B \cap \neg C | A)\) then \(\text{P}(B | A) \geq \text{ P}(B \cap C | A)\).↩︎

8  Sensitivity and Specificity
10  Law of Large Numbers

© 2026 Patrick Wessa. Provided as-is, without warranty.

Feedback: e-mail | Anonymous contributions: click to copy (Sats) | click to copy (XMR)

Cookie Preferences