6 Jeffreys’ axiom system
In this section, we introduce the axiom system of Jeffreys (1961) for probability theory, as summarized by Zellner (1983). Then we will deduce important theorems and formalize in a general way some results of Chapter 5.
Jeffreys uses the convention that if probabilities are expressed by numbers, larger numbers correspond to more probable statements than lower numbers. Furthermore, he assumes the Exclusiveness principle conventionally to be true.
6.1 Axiom 1: Comparability
Only one of these alternatives is (exclusively and exhaustively) true
\[ \begin{gather*} \text{P}(B | A) < \text{P}(C | A) \text{, or} \\ \text{P}(B | A) = \text{P}(C | A) \text{, or} \\ \text{P}(B | A) > \text{P}(C | A) \end{gather*} \]
6.2 Axiom 2: Transitivity
\[ \begin{gather*} \text{P}(B | A) > \text{P}(C | A) \text{, and ,} \text{P}(C | A) > \text{P}(D | A) \Rightarrow \text{P}(B | A) > \text{P}(D | A) \end{gather*} \]
6.3 Axiom 3: Deducibility
For all propositions \(A\): for all propositions \(B_i\) for which \(A\) does imply \(B_i\), it follows that
\[ \text{P}(B_i | A) = \text{P}(B_j | A) \text{, for all } i,j \]
For all propositions \(A\): for all propositions \(B_i\) that are inconsistent with \(A\) (i.e. \(A\) and \(B_i\) cannot both be true), it follows that
\[ \text{P}(B_i | A) = \text{P}(B_j | A) \text{, for all } i,j \]
6.4 Axiom 4
If \(\text{P}(B_1 B_2 | A) = 0\) and \(\text{P}(C_1 C_2 | A) = 0\) and \(\text{P}(B_1 | A) = \text{P}(C_1 | A)\) and \(\text{P}(B_2 | A) = \text{P}(C_2 | A)\) then \(\text{P}((B_1 \cup B_2) | A) = \text{P}((C_1 \cup C_2) | A)\).
6.5 Axiom 5
All relationships \(\text{P}(A) > \text{P}(B)\) can be expressed by numbers (i.e. a set of real increasing numbers).
This axiom implies that there are “enough” numbers such that all probability preferences can be expressed. It is also implicitly assumed that a probability of 1 is equivalent to certainty. Therefore \(\text{P}(B | A) = 1\) if A implies B.
6.6 Axiom 6
If \(A \cap B\) implies \(C\) then \(\text{P}(B \cap C | A) = \text{P}(B | A)\).
6.7 Axiom 7
\[ \text{P}(B \cap C | A) = \text{P}(B | A) \text{P}(C | B \cap A) / \text{P}(B | B \cap A) \]
which simplifies to
\[ \text{P}(B \cap C | A) = \text{P}(B | A) \text{P}(C | B \cap A) \]
since \(\text{P}(B | B \cap A) = 1\).
6.8 Theorem A
If \(A\) implies “not \(B\)” then \(\text{P}(B | A) = 0\).
6.8.0.1 Proof
Using Axiom 3 it is easy to prove this. We know that \(A\) implies \(\neg B\), \(A\) implies \(A\), and \(\text{P}(A | A) = 1\). Therefore \(\text{P}(\neg B | A) = \text{P}(A | A) = 1\).
It is obvious from this theorem and the fact that “more probable propositions” have larger probability numbers than “less probable propositions”, that all probabilities should be larger or equal to 0: \(\text{P}(\neg B \cup B | A) = \text{P}(\neg B | A) + \text{P}(B | A)\).
6.9 Theorem B
If \(B\) is true if and only if \(C\) is true (we say that \(B\) and \(C\) are equivalent) then it follows that
\(B \text{ implies } B \cap C\)
\(C \text{ implies } B \cap C\)
\(\text{ P}(B | A) = \text{P}(C | A)\)
6.9.1 Proof
By Axiom 7,
\[ \text{P}(B \cap C | A) = \text{P}(B | A)\text{P}(C | B \cap A). \]
Because \(B\) and \(C\) are equivalent, \(B\) implies \(C\), hence \(\text{P}(C | B \cap A)=1\). Therefore
\[ \text{P}(B \cap C | A) = \text{P}(B | A). \tag{1} \]
Similarly, by Axiom 7,
\[ \text{P}(B \cap C | A) = \text{P}(C | A)\text{P}(B | C \cap A). \]
Since \(C\) implies \(B\), we have \(\text{P}(B | C \cap A)=1\), and thus
\[ \text{P}(B \cap C | A) = \text{P}(C | A). \tag{2} \]
From (1) and (2), it follows that
\[ \text{P}(B | A) = \text{P}(C | A). \]
This proves Theorem B.
6.10 Theorem C
\[ \text{P}(B | A)\text{ = P}(B \cap C | A)\text{ + P}(B \cap \neg C | A). \]
Therefore it follows that \(\text{P}(B | A) \geq \text{ P}(B \cap C | A)\) and \(\text{P}(B \cup C | A) \geq \text{ P}(C | A)\).
6.10.0.1 Proof
From Axiom 7 we have
\[ \text{P}(B \cap C | A) = \text{P}(B | A)\text{P}(C | B \cap A) \]
and
\[ \text{P}(B \cap \neg C | A) = \text{P}(B | A)\text{P}(\neg C | B \cap A). \]
Adding both expressions gives
\[ \text{P}(B \cap C | A) + \text{P}(B \cap \neg C | A) = \text{P}(B | A)\left[\text{P}(C | B \cap A) + \text{P}(\neg C | B \cap A)\right]. \]
By the complementation rule for conditional probabilities, \(\text{P}(C | B \cap A) + \text{P}(\neg C | B \cap A) = 1\). Hence
\[ \text{P}(B \cap C | A) + \text{P}(B \cap \neg C | A) = \text{P}(B | A). \]
This proves Theorem C.
Interpretation: Theorem C simply partitions the event \(B\) into two disjoint cases, one where \(C\) is true and one where \(\neg C\) is true. Example: if \(B\) is “draw a face card” and \(C\) is “draw a king”, then “face card” is split into “king” and “not king”.
6.11 Theorem D (Addition Rule)
\[ \text{P}(B | A)\text{ + P}(C | A)\text{ = P}(B \cap C | A)\text{ + P}(B \cup C | A). \]
From this theorem it follows that \(\text{P}(B \cup C | A) \le \text{ P}(B | A) \text{ + P}(C | A)\).
Theorems C and D can be used to show that
\(\text{MAX}(\text{P}(B | A)\text{, P}(C | A)) \le \text{ P}(B \cup C | A) \le \text{ P}(B | A)\text{ + P}(C | A)\).
This can be interpreted as a lower and upper bound for \(\text{P}(B \cup C | A)\) respectively.
6.11.1 Proof
By Theorem C,
\(\text{P}(B | A)\text{ = P}(B \cap C | A)\text{ + P}(B \cap \neg C | A)\)
and
\(\text{P}(C | A)\text{ = P}(B \cap C | A)\text{ + P}(\neg B \cap C | A)\).
Adding these equations gives
\(\text{P}(B | A)\text{ + P}(C | A)\text{ = 2\text{P}(B \cap C | A)\text{ + P}(B \cap \neg C | A)\text{ + P}(\neg B \cap C | A)\).
Since \((B \cup C)\) is the disjoint union of \((B \cap \neg C)\), \((B \cap C)\), and \((\neg B \cap C)\),
\(\text{P}(B \cup C | A)\text{ = P}(B \cap \neg C | A)\text{ + P}(B \cap C | A)\text{ + P}(\neg B \cap C | A)\).
Substituting this into the previous equation yields
\(\text{P}(B | A)\text{ + P}(C | A)\text{ = P}(B \cap C | A)\text{ + P}(B \cup C | A)\) (Q.E.D.)
Interpretation: Theorem D is the conditional version of the standard addition rule (inclusion-exclusion).
6.12 Theorem E
\[ \begin{gather*} \text{If P}(B_1)\text{ = P}(B_2)\text{ = P}(B_3)\text{ = }…\text{ = P}(B_n)\\ \text{ and }B_1, B_2, B_3, …, B_n\text{ are mutually } \\ \text{exclusive on data $A$, and $Q$ = }B_{i(1)} \cup B_{i(2)} \cup B_{i(3)} \cup … \cup B_{i(k)}\text{, and} \\ R = B_{j(1)} \cup B_{j(2)} \cup B_{j(3)} \cup … \cup B_{j(h)}\text{, then P}(Q | A)\text{ / P}(R | A) = k / h. \end{gather*} \]
6.12.1 Proof
\(\text{P}(Q | A)\text{ = P}(B_{i(1)} | A)\text{ + P}(B_{i(2)} | A)\text{ + P}(B_{i(3)} | A)\text{ + }\) … \(\text{ + P}(B_{i(k)} | A)\)
(by convention under exclusiveness)
\(= k P(B_1)\)
\(\text{P}(R | A)\text{ = P}(B_{j(1)} | A)\text{ + P}(B_{j(2)} | A)\text{ + P}(B_{j(3)} | A)\text{ + }\) … \(\text{ + P}(B_{j(h)} | A)\)
(by convention under exclusiveness)
\(= h \text{P}(B_1)\)
Therefore \(\text{P}(Q | A)\text{ / P}(R | A) = k / h\) (Q.E.D.).
6.13 Theorem F
\[ \begin{gather*} \text{If P}(B_1)\text{ = P}(B_2)\text{ = P}(B_3)\text{ = }…\text{ = P}(B_n)\\ \text{ and }B_1, B_2, B_3, …, B_n\text{ are mutually} \\ \text{exclusive and exhaustive on data $A$ and }\\ Q = B_{i(1)} \cup B_{i(2)} \cup B_{i(3)} \cup … \cup B_{i(k)}\text{, and} \\ R = B_1 \cup B_2 \cup B_3 \cup … \cup B_n\text{, then $A$ implies $R$, and P}(R | A)\text{ = 1, and} \\ \text{P}(Q | A) = k / n. \end{gather*} \]
According to Jeffreys, this can be interpreted as follows:
… given that a set of alternatives are equally probable, exclusive and exhaustive, the probability that some one of any subset is true is the ratio of the number in that subset to the whole number of possible cases
(Zellner (1983)).
Interpretation: if alternatives are equally likely and exhaustive, probability becomes “count favorable cases / count total cases.” Example: for one fair die roll, \(Q=\{2,4,6\}\) has 3 favorable outcomes out of 6 equally likely outcomes, so \(\text{P}(Q)=3/6=1/2\).
6.13.1 Proof
By definition of exclusivity and exhaustivity:
\(\text{P}(R | A)\text{ = P}(B_1 \cup B_2 \cup B_3 \cup … \cup B_n | A) \text{ = P}(B_1)\text{ + P}(B_2)\text{ + P}(B_3)\text{ + }…\text{ + P}(B_n) = 1\)
From theorem E it follows that \(\text{P}(Q | A) / \text{ P}(R | A) = k / n\) and therefore \(\text{P}(Q | A) = k / n\) (Q.E.D.).
6.14 Theorem G (density and distribution)
\[ \begin{gather*} \text{The cumulative distribution function }(\text{cdf})\text{ is F}(z_0) \\ \text{ with F}(z_0)\text{ = P}(z \le z_0 | A) \\ (i.e.\text{ the probability that a continuous variable $z$ is smaller than or equal to }z_0\text{, given }A). \\ \\ \text{If F}(z_0)\text{ is differentiable, it is possible to define the probability density function }(\text{pdf}) \\ \text{as f}(z_0)\text{ = F'}(z_0)\text{. For }\Delta z > 0,\text{ } \text{P}(z_0 < z < z_0 + \Delta z | A)\text{ = } \int_{z_0}^{z_0+\Delta z} f(x|A)\text{d}x. \\ \text{(In infinitesimal notation, this is often written heuristically as }f(z_0)\text{ d$z$ = P}(z_0 < z < z_0 + \text{d}z | A)\text{.)} \end{gather*} \]
For continuous distributions, \(\text{P}(z < z_0 | A) = \text{P}(z \le z_0 | A)\) because a single point has probability zero.
6.14.1 Proof
For a continuous variable with density \(f(x|A)\), the cdf satisfies \(F(z_0)=\int_{-\infty}^{z_0} f(x|A)\text{d}x\). The derivative of the cdf can be written as
\[ \begin{gather*} \underset{\text{d}z\rightarrow 0}{\text{lim}}\frac{F(z_0 \text{ + d}z) \text{ - F}(z_0)}{\text{d}z}\text{ = }\underset{\text{d}z \rightarrow 0}{\text{lim}}\frac{ \int_{-\infty}^{z_0+dz} f(x|A)\text{d} x - \int_{-\infty}^{z_0} f(x|A)\text{d}x}{\text{d}z} \\ = \underset{\text{d}z \rightarrow 0}{\text{lim}}\frac{ \int_{z_0}^{z_0+dz} f(x|A)\text{d} x}{\text{d}z} \end{gather*} \]
and therefore \(f(z_0)=F'(z_0)\). For any interval width \(\Delta z > 0\), \(\text{P}(z_0 < z < z_0 + \Delta z | A) = \int_{z_0}^{z_0+\Delta z} f(x|A)\text{d}x\). In infinitesimal notation this is often written heuristically as f\((z_0)\) d\(z\) = P(\(z_0\) < \(z\) < \(z_0\) + d\(z | A\)) (Q.E.D.).
6.15 Theorem H (suggesting axiom 7)
\[ \begin{gather*} \text{If P}(B_1)\text{ = P}(B_2)\text{ = P}(B_3)\text{ = }…\text{ = P}(B_n)\text{, and }B_1, B_2, B_3, …, B_n\text{ are mutually exclusive} \\ \text{on data $A$ and also on data $R$ } \cap \text{ $A$, and $Q$ = }B_1 \cup B_2 \cup B_3 \cup … \cup B_n\text{, and $R$ is a} \\ \text{subset of $Q$ }(\text{with $r$ propositions})\text{, and $S$ is a subset of $Q$ }(\text{with $s$ propositions})\text{, then} \\ \text{P}(R \cap S | A)\text{ = P}(R | A)\text{ P}(S | R \cap A) / \text{ P}(R | R \cap A). \end{gather*} \]
\[ \begin{gather*} \text{This can be simply written as P}(R \cap S | A)\text{ = P}(R | A)\text{ P}(S | R \cap A)\\\text{ since P}(R | R \cap A) = 1. \end{gather*} \]
The interpretation of this theorem is given in Zellner:
In other words, given \(A\) throughout, the probability that the true proposition is in the intersection of \(R\) and \(S\) is equal to the probability that it is in \(R\) times the probability that it is in \(S\), given that it is in \(R\). … he (Jeffreys) regards this theorem as suggestive of the simplest rule that relates probabilities based on different data, here denoted by \(A\) and \(R \cap A\), and puts forward the following axiom (see axiom 7).
(Zellner (1983)).
In other words: theorem H is a suggestion for axiom 7.
6.15.1 Proof
\[ \begin{gather*} \text{Let }m = |R \cap S|\text{ be the number of propositions common to }R\text{ and }S. \\ \text{Then, by theorem F, } \text{P}(R \cap S | A) = m/n\text{ and } \text{P}(R | A) = r/n. \\ \text{Within }R\text{ (conditioning on }R \cap A\text{), the probability of }S\text{ is } \text{P}(S | R \cap A)=m/r. \\ \text{Therefore } \text{P}(R | A)\text{P}(S | R \cap A)= (r/n)(m/r)=m/n=\text{P}(R \cap S | A). \end{gather*} \]
6.16 Theorem I (Bayes Theorem)
\[ \begin{gather*} \text{If }B_1, B_2, B_3, …, B_n\text{ are alternatives, and $A$ is the available information, and }X \\ \text{is the additional information, then} \end{gather*} \]
\[ \forall i \in (1,2,…,n): \frac{\text{P}(B_i | X \cap A) \text{P}(B_i | B_i \cap A) }{\text{P}(B_i | A) \text{P}(X | B_i \cap A) } = \mu \]
where \(\mu\) is constant.
Alternatively
\[ \forall i \in (1,2,…,n): \text{P}(B_i | X \cap A) = \mu \text{P}(B_i | A) \text{P}(X | B_i \cap A) \]
where
\(\mu = \left( \sum_{i=1}^{n} \text{P}(B_i | A) \text{P}(X | B_i \cap A) \right)^{-1}\)
(also known as Bayes’ theorem).