Back to Bayesics - I

Back to Bayesics - Part 1/3



What's Bayesian? Why Bayesian?

So, what is a Bayesian approach, who can claim to be Bayesian, and what constitutes Bayesian thinking? Does using Bayes's Rule make us Bayesians? No. Many approaches make use of Bayes's Rule (an aside: it was Laplace to whom we owe a great deal for the development of this formula and its understanding; Reverend Bayes came up with this formula through rather flawed reasoning). What sets a Bayesian apart from a frequentist are the underlying philosophies that inform methodology. For frequentists, when we sample and resample measureables, uncertainty arises due to varying measurements, thus only the sampling process (real or as a thought experiment) has a probability distribution (only data is probabilistic). For Bayesians that is but one source of uncertainty among many other aspects in the process of inference which have probability distributions. It could very well be that the parameters themselves are uncertain. The uncertainty of the parameters themselves are propagated through our model and thusly into our inference. Non-measureables like the standard deviation of the weight will have its own distribution as well as the mean (so even the standard deviation will have some mean and its own standard deviation, the standard deviation of the standard deviation). There are many different possible combinations of average weights and spread for those averages, based on what we measured, so we take into account all of them, and we will see which ones will more plausibly yield our data. In fact, in forming models, we can think of the parameters that form them as stand in for unknowns which may in the future become informed by an entirely new model with other parameters informing it. It's turtles all of the way down (in this case, parameters all of the way down). In addition, Bayesians don't depend on long term asymptotic behaviors to make useful inference (although this is certainly recognized), so whether a dataset is small or not, it is always useable and informative (although there's no such thing as a free lunch, and with Bayesian inference we must chose a prior, more on how to choose a prior based on maximum entropy and the importance of this in later posts). We will not need ad-hoc rules like the rule of 30 (samples), or 5% rejection rule for statistical significance.

From one perspective, Bayes's Rule is a shorthand for counting up the possible ways for a particular event to occur, out of all possible occurences. From another, it is the the only mathematically viable functional form that corresponds with certain criterias of 'common sense' and consistency, to make inferences about events. In this series of three blog posts, we will explore the latter sentiment (viewing Bayes Theorem from the lens of logic), first by asking if there exists a function that will reason in a 'sufficient' manner (and probe what operations within this function will be deemed sufficient). Then, we will discover the first form of reasoning called the product rule, allowing us to ask about the plausibility of two propositions AB (A and B), and finally we will discover the sum rule that will complete the foundation for plausible inference.

Symbols and Notation

Before we embark on our derivation, lets first define a few symbols and notation so we can talk about things in shorthand, which will make concise work of our discussions. The most fundamental to our discussions will be propositions which we will represent with uppercase alphabet: $$ A, B, C, D, E, F.... $$

These propositions are of the simple Aristotelian logic types of True or False values. For example, we may define: $$ A \equiv \text {Unicorns, of the horned variety, exist} $$

We need not be able to ascertain the truth value of such a statement (that is the job of our system in construction), but it will be necessary for now to limit ourselves to unambiguous statements. Readers may question whether or not there may be a better logic system from which we can derive our rules of inference, but we are reminded of Jayne's argument that if such a logic system exists, it would necessarily need to inform us of things not already known or yet derivable and still remain consistent with existent findings. We have yet to discover such an alternative system that is consistent. Then having satisfied ourselves that we will be able to work within this system, we will refer to the conventions of Boolean algebra in order to reason about the truth of more than one proposition at a time. In order to reason about the truth value of both A and B, we introduce the logical product or the conjunction:

$$ AB \text{}$$ Above, we represent the statement, "both propositions A and B are True." Clearly when we reason about the truth of both A and B, we can first reason about B, or A, so BA will yield the same result as AB (the symmetry here will be very useful later on in our derivation). Next, we introduce the logical sum or disjunction symbol to talk about the truth value of one proposition OR another (here, we mean the logical operator OR, which means either A or B, or both of them, not the common lingustic use of or that means exclusively A or B, but not both).

$$ A + B $$ Above, we represent the statement, "at least one proposition, A, B is true." Finally, we can compare the truth values of two propositions:

$$ A = B $$ The above says that B has the same truth value as A. In contrast, we can define what a proposition represents: $$ A \equiv \overline B $$ Note that we have slipped in a bar over B, which represents the negation of proposition B, thus the statement above says that by defition, A is the negation of B.

Finally, we will list the common and useful identities that are easily provable and can be found in many textbooks on logic (we copy the following from Jaynes). Note the usage of parenthesis for dictating grouping and order of operations.

$$ \begin{align*} \text {Idempotence: } &\begin{cases} AA=A\\ A+A=A\\ \end{cases} \\\\ \text {Commutativity: } &\begin{cases} AB = BA\\ A+B=B+A\\ \end{cases} \\\\ \text {Associativity: } &\begin{cases} A(BC) = (AB)C = ABC\\ A + (B + C) = (A + B) + C = A + B + C\\ \end{cases}\\\\ \text {Distributivity: } &\begin{cases} A(B + C) = AB + AC\\ A + (BC) = (A + B)(A + C)\\ \end{cases}\\\\ \text {Duality: } &\begin{cases} \text {If } C = AB, \text { then } \overline C = \overline A + \overline B\\ \text {If } D = A + B, \text { then } \overline D = \overline A \overline B \end{cases}\\\\ \end{align*} $$

Sufficient Function

Before we proceed to the derivation, it is necessary to see if any function that might help us plausibly infer will be sufficient for any number of propositions in its argument. To do that, we first consider a single proposition, and the possible values that the function might take. Given a single proposition, A, there are two possible values for it to take on, T or F. Lets consider what possible outputs a function might give:


Values for A: T F
$F_1(A)$ T T
$F_2(A)$ T F
$F_3(A)$ F T
$F_4(A)$ F F


So given that A can be T or F, the first function $F_1(A)$ outputs T and T, respectively. Next, the function $F_2(A)$, given input T and F, will output T, and F, respectively, etc. The function can output T and F independently for each value of A, so we have $2^x$ possible logic functions where $x$ is the combinations of distinct values the argument(s) can take (in this case 2). Thus for $n = 1$ proposition as input (just A), we have only $2^2 = 4$ possible logic functions. From inspection, it is clear that the output for these functions have equivalent truth values: $$F_1(A) = A + \overline A $$ $$F_2(A) = A $$ $$F_3(A) = \overline A $$ $$F_4(A) = A\overline A$$

Lets see another example for two propositions (A, B) so we may discover a useful pattern. Propositions A and B can take on T or F, thus there are 4 combinations possible with $(A, B)$ as inputs, namely TT, TF, FT, and FF. Since a function may output T and F for those inputs, there are $2^4 = 16$ possible logic functions:



A, B: T, T T, F F, T F, F
$F_1(A,B)$ T F F F
$F_2(A,B)$ F T F F
$F_3(A,B)$ F F T F
$F_4(A,B)$ F F F T
$F_5(A,B)$ T T F F
$F_6(A,B)$ T F T F
$F_7(A,B)$ T F F T
$F_8(A,B)$ F T T F
$F_9(A,B)$ F T F T
$F_{10}(A,B)$ F F T T
$F_{11}(A,B)$ T T T F
$F_{12}(A,B)$ T T F T
$F_{13}(A,B)$ T F T T
$F_{14}(A,B)$ F T T T
$F_{15}(A,B)$ T T T T
$F_{16}(A,B)$ F F F F


The dotted lines segregating the group functions into groups $F_1$ - $F_4$, and $F_5$ - $F_{10}$, etc., serve two purposes. The first purpose is another perspective into seeing that the first group is equivalent to asking how many ways are there to output only one T value out of four possible inputs $\binom {1}{4}$ (or equivalently, three F out of four $\binom {3}{4}$). The answer is 4, four distinct functions. Thus, the total number of distinct functions that will yield all of the possible ways to have zero T, one T, two T, three T, and four T, out of four different inputs is, $\binom {0}{4} + \binom {1}{4} + \binom {2}{4} + \binom {3}{4} + \binom {4}{4} = 16 = 2^4$ . This is just another way of seeing our findings previously, regarding maximum possible logic functions. For the second purpose, lets inspect the first four functions to see their equivalents:

$$F_1(A,B) = AB$$ $$F_2(A,B) = A \overline B$$ $$F_3(A,B) = \overline A B$$ $$F_4(A,B) = \overline A \overline B$$


Clearly, these are just the conjunction (product) of our basic propositions and/or some combination of their negation. Lets check out $F_5(A,B)$. Since it has two truth values in the first, and second column, clearly we can see that it is a combination of $F_1(A,B)$ and $F_2(A,B)$ because those have T for values in their first and second column, respectively.

$$F_5(A,B) = F_1(A,B) + F_2(A,B) \\ = AB + A \overline B \\ = A(B + \overline B) \\ = A$$

We can perform the same kind of inspection with $F_6(A,B)$ to see that since it has T as values in its first and third column, $F_1(A,B)$ and $F_3(A,B)$ must be its components. We see that the rest of the functions have such a relationship to the first four basic conjunctions.

$$F_6(A,B) = F_1(A,B) + F_3(A,B) \\ = AB + \overline A B \\ = (A + \overline A)B \\ = B$$
$$F_{11}(A,B) = F_1(A,B) + F_2(A,B) + F_3(A,B) \\ = AB + A \overline B + \overline A B \\ = A(B + \overline B) + \overline AB \\ = A + \overline A B$$

Looking at the first four equations again, we see that because each function has a T in each of the four columns, it is possible to form any combination of other functions within this function-space. In other words, they are basis vectors that span the space of n=2 logic functions. We see that as a result, all other combinations are negations, conjunctive, disjunctive, or some combination of these three operations. In logic studies, this is called a reduction to disjunctive normal form. It is easy to see that with any larger set of propositions, n, it is possible to go through the same process and arrive the same conclusion that we just require our basic propositions, and these three operations. Thus the basic propositions along with these operations form a sufficient set. With a little more thought, we remember that previously, we jotted down the property of duality, which is relevant to us in a modified form:

$$A + B = \overline {\overline A \overline B}$$

Hence, it is possible to reduce three operations down to two operations, as the disjunction of two propositions can be related with just the negation and conjunction! Lets take a step back and survey the work that we have done. We have shown that it may be possible for a function to exist, that when given n number of propositions, can sufficiently infer all its possible outcomes. Such a function is deemed sufficient for plausible inference once we have discovered the two rules it requires to operate on propositions, the sum and product rule. Note, it is possible to reduce this further to one operation called NAND, or another called NOR, but for our purposes, we are done for the moment! Rejoice.

Credits

In this blog post, I follow the thinking laid out in E.T. Jaynes's Probability Theory, with supplemental materials from Myron Tribus's Rational Descriptions, Decisions, and Designs. I have also followed references to various papers, such as Cox's paper solving the functional equations presented. Where Jaynes assumed logical bridges unnecessary, I have filled out with additional resources, as well as different perspectives to elucidate the sometimes unclear prosaic nature of his book. In addition, I have found Richard McElreath's Statistical Rethinking: A Bayesian Course with Examples in R and Stan, to be an invaluable resource in practical usage. None of these thoughts are original, but they have been presented and reorganized in such a way that may be more relevant to those interested in Machine Learning and AI.