17 Multinomial Distribution

17.1 Definition

Let \(\mathbf{X}=(X_1,\dots,X_K)\) and \(\mathbf{r}=(r_1,\dots,r_K)\) with \(k \in \{1,\dots,K\}\). The multinomial distribution is

\[ \text{P}(\mathbf{X}=\mathbf{r})= \begin{cases} \frac{n!}{r_1!\cdots r_K!}\, p_1^{r_1}\cdots p_K^{r_K}, & \text{if } \sum_{k=1}^K r_k=n,\\ 0, & \text{otherwise.} \end{cases} \]

where \(K\) = number of categories, \(p_k\) = probability of category \(k\), \(p_k \ge 0\), \(\sum_{k=1}^K p_k = 1\), \(n\) = number of independent draws, and \(X_k\) = number of outcomes in category \(k\).

In other words, the Multinomial Distribution is a generalisation of the Bernoulli and Binomial Distribution:

when \(K = 2\) and \(n = 1\) then it is equivalent to the Bernoulli Distribution
when \(K = 2\) and \(n > 1\) it describes the Binomial Distribution

17.2 Mean

\[ \text{E}(X_k) = n p_k \]

17.3 Variance

\[ \text{V}(X_k) = n p_k (1-p_k) \]

17.4 Covariance

\[ \text{Cov}(X_i, X_j) = -n p_i p_j \quad \text{for } i \ne j \]

The covariance is negative because counts are constrained to sum to \(n\): if one category gets more counts, at least one other category must get fewer.

17.5 Moment Generating Function

\[ M_{\mathbf{X}}(t_1,\dots,t_K)=\left(\sum_{k=1}^K p_k e^{t_k}\right)^n \]

17.6 Purpose

Within this handbook, the Multinomial Distribution has multiple practical uses:

Multi-class event modeling: whenever one trial can fall into one of several categories (e.g. support ticket outcomes, customer response classes, defect types).
Bridge from Binomial to multi-category data: the Binomial model (Chapter 13) is the special case \(K=2\); multinomial extends this to \(K>2\).
Foundational model for count-based classification: the Multinomial Naive Bayes Classifier directly uses this distribution for token/count features (Chapter 9).
Expected-vs-observed category diagnostics: expected counts \(n p_k\) from the multinomial model connect naturally to the Pearson chi-squared framework (Section 124.1, Chapter 124).
Contingency-table interpretation: multinomial logic underlies how row/column category counts are interpreted in contingency tables (Chapter 57) and in classification summaries such as confusion matrices (Chapter 59).

17.7 R Module

The Multinomial Probabilities app is available in the handbook menu:

Distributions / Multinomial Probabilities

It is also accessible directly at:

https://shiny.wessa.net/multinomial/

17.8 Business Example: Support Ticket Routing Mix

A SaaS support team routes incoming premium tickets into three mutually exclusive categories:

resolved on first contact
resolved after follow-up
escalated to engineering

Based on historical operations, the expected proportions are:

\[ (p_1,p_2,p_3)=(0.55,0.30,0.15) \]

On a given shift, \(n=20\) tickets were handled and the observed counts were:

\[ (x_1,x_2,x_3)=(8,8,4) \]

The exact multinomial probability of this specific split is:

counts <- c(8, 8, 4)
probs <- c(0.55, 0.30, 0.15)
n <- sum(counts)

cat("Exact multinomial probability:\n")
print(dmultinom(counts, prob = probs))

expected <- n * probs
names(expected) <- c("First contact", "Follow-up", "Escalated")
cat("\nExpected counts under historical mix:\n")
print(expected)

chisq_stat <- sum((counts - expected)^2 / expected)
cat("\nPearson chi-squared statistic (descriptive):\n")
print(chisq_stat)

Exact multinomial probability:
[1] 0.01734239

Expected counts under historical mix:
First contact     Follow-up     Escalated 
           11             6             3 

Pearson chi-squared statistic (descriptive):
[1] 1.818182

You can reproduce this scenario with the preconfigured app below:

Interactive Shiny app (click to load).

Open in new tab

Interpretation:

The observed follow-up and escalation counts are above their expected values, while first-contact resolution is below expectation.
This may indicate a temporary complexity spike (harder tickets), staffing mismatch, or process bottlenecks.
The app’s chi-squared statistic is useful as a quick discrepancy indicator; for formal inferential testing and p-values, continue with the Pearson chi-squared test framework in Section 124.1. For goodness-of-fit with \(K\) categories and no estimated parameters, the reference degrees of freedom are \(K-1\).

17.9 Additional Academic Example: Hardy-Weinberg Genotype Counts

For a biallelic locus with allele frequencies \(p_A=0.7\) and \(p_a=0.3\), Hardy-Weinberg proportions are:

\[ (p_{AA},p_{Aa},p_{aa})=(p_A^2,\ 2p_Ap_a,\ p_a^2)=(0.49,0.42,0.09). \]

Suppose \(n=120\) individuals are observed with genotype counts:

\[ (x_{AA},x_{Aa},x_{aa})=(64,45,11). \]

counts_hw <- c(64, 45, 11)
probs_hw <- c(0.49, 0.42, 0.09)
n_hw <- sum(counts_hw)

cat("Exact multinomial probability:\n")
print(dmultinom(counts_hw, prob = probs_hw))

expected_hw <- n_hw * probs_hw
names(expected_hw) <- c("AA", "Aa", "aa")
cat("\nExpected counts under Hardy-Weinberg proportions:\n")
print(expected_hw)

chisq_hw <- sum((counts_hw - expected_hw)^2 / expected_hw)
cat("\nPearson chi-squared statistic (descriptive):\n")
print(chisq_hw)

Exact multinomial probability:
[1] 0.005733821

Expected counts under Hardy-Weinberg proportions:
  AA   Aa   aa 
58.8 50.4 10.8 

Pearson chi-squared statistic (descriptive):
[1] 1.042139