8 Sensitivity and Specificity

Bayes’ Theorem is closely related to the definitions of Sensitivity and Specificity, which can be illustrated for the case of a binary classification problem. Suppose that we are interested in real-time fraud detection for online credit card payments (if \(H_1\) is true, then the financial transaction is fraudulent and \(H_2\) is false). In other words, we wish to design a model which helps us to decide whether or not a financial transaction is allowed to proceed during the process of completing an e-commerce sale.

The default prediction we make (i.e. when there is insufficient evidence to suggest otherwise) is that a transaction is legitimate (\(H_2\) is true). We do this because we know from historical analysis that most transactions are legitimate and only about 0.2% of transactions (i.e. the so-called “prevalence”) involve stolen credit cards or some form of identity fraud. Another reason why we do not want to decide that \(H_1\) is true by default is the legal presumption of innocence. Since we wish to design a real-time fraud detection system (i.e. one that refuses a credit card transaction to proceed before any money is transferred), false negatives are often considered less disruptive to immediate customer experience than false positives. However, undetected fraud still has real costs and may only be recoverable after investigation, if detected later.

Table 8.1: Sensitivity and Specificity

	\(H_2\) is true	\(H_1\) is true
Accept \(H_2\)	True Negative (TN)	False Negative (FN) (type II error)
Reject \(H_2\)	False Positive (FP) (type I error)	True Positive (TP)
	True Negative Rate = TNR = TN / (TN + FP) = Specificity	True Positive Rate = TPR = TP / (FN + TP) = Sensitivity (Recall)

Table 8.1 shows that the Sensitivity or True Positive Rate (TPR) reflects the proportion of correctly identified fraud cases. Fraud detection models with high Sensitivity are good at detecting fraud cases. On the other hand, the table also shows the Specificity or True Negative Rate, which represents the proportion of correctly identified transactions that are legitimate. In other words, fraud detection models with high Specificity are good at detecting legitimate transactions.

A perfect prediction model would have 100% Specificity and Sensitivity. Most of the time, however, these tests will have a non-zero error. In fact, there is a theoretical lower bound on classification error called the “Bayes error rate,” determined by the overlap of class distributions, and it is generally unknown in practice (see Chapter 113).

Furthermore, there is often a trade-off between Sensitivity and Specificity. For instance, a change that improves the Sensitivity of the model will often result in lower Specificity and vice versa. In contrast, improving both (Sensitivity and Specificity) at the same time could be achieved when more or better quality of data becomes available.

Suppose that our fraud detection system has 99% Sensitivity and Specificity and that the prevalence of fraud (based on historical data) is 0.2%. What is the probability that a random transaction which is classified as a “positive” actually involves fraud?

Suppose that the computed fraud probability is used to support a practical decision (e.g. block the transaction immediately or send it for manual review). How large should the probability of fraud be before we decide to act?

The simple formulation of Bayes’ Theorem Equation 7.3 states that

\[ \begin{equation} \text{P}(H_1 | D) = \frac{\text{P}(D | H_1) \text{P}(H_1)}{\text{P}(D)} \end{equation} \]

which becomes¹

\[ \begin{equation} \text{P}(H_1 | D+) = \frac{\text{P}(D+ | H_1) \text{P}(H_1)}{\text{P}(D+ | H_1) \text{P}(H_1) + \text{P}(D+ | H_2) \text{P}(H_2) } \end{equation} \]

\[ \begin{equation} \text{P}(H_1 | D+) = \frac{0.99 \times 0.002}{0.99 \times 0.002 + (1 - 0.99) (1 - 0.002) } \simeq 16.6\% \end{equation} \]

The same result can be obtained through the odds formula Equation 7.4

\[ \begin{equation} \frac{\text{P}(H_1 | D+)}{\text{P}(H_2 | D+)} = \frac{\text{P}(D+ | H_1)}{\text{P}(D+ | H_2)} \frac{\text{P}(H_1)}{\text{P}(H_2)} = \frac{0.99}{(1 - 0.99)} \frac{0.002}{(1 - 0.002)} = \frac{0.00198}{0.00998} \end{equation} \]

which leads to a probability of \(0.00198 / (0.00198 + 0.00998) \simeq 16.6%\).

Whether 16.6% is high enough to justify action depends on a decision threshold and the relative costs of false positives (blocking legitimate transactions) and false negatives (allowing fraud to proceed).

Interactive Shiny app (click to load).

Open in new tab

Early Threshold Insight (used again later)

The fraud example above introduces a core idea that will return throughout the handbook:

there is no universally correct decision threshold.
The threshold must be chosen by the purpose of the decision.

Once we know the posterior probability of fraud (here about 16.6%), we still need a rule for action: should the system block the transaction, allow it, or send it for manual review? A stricter threshold may reduce false positives (fewer legitimate transactions blocked) but increase false negatives (more fraud missed). A more permissive threshold may do the opposite. The appropriate choice depends on the relative costs of these errors and the operational context.

This is the same logic used later in hypothesis testing:

in classification, we choose a classification threshold;
in hypothesis testing, we choose a significance threshold \(\alpha\) (or equivalently a confidence level).

In both cases, changing the threshold changes the balance between error types. This is why the handbook later distinguishes between thresholds for confirmatory tests and thresholds used for diagnostic screening or assumption checks.

For the classification-threshold version of this idea, see Chapter 60.
For the general framework (including \(\alpha\) and confidence levels), see Chapter 112.

The symbol D is replaced by D+ because we predict that the transaction is fraudulent.↩︎