#### Learning and Cognitive Systems

# Definition: Bayesian Network

Andrew D. Gordon, Thomas A. Henzinger, Aditya V. Nori, and Sriram K. Rajamani. 2014. Probabilistic programming. In *Proceedings of the on Future of Software Engineering* (FOSE 2014). ACM, New York, NY, USA, 167-181. DOI=10.1145/2593882.2593900 doi.acm.org/10.1145/2593882.2593900

"A Bayesian Network is a directed acyclic graph *G = <V, E>*, where every vertex *v in V *is associated with a random variable *Xv*, and every edge *(u, v) in E* represents a direct dependence from the random variable *Xu* to the random variable *Xv*. Let *Deps(v) = {u | (u, v) in E}* denote the *direct dependences* of node *v in V*. In a Bayesian Network, each node *v in V* of the graph is associated with a *conditional probability distribution CPD(v), *which denotes the probability distribution of *Xv* conditioned over the values of the random variables associated with the direct dependences *D(v)*." (Gordon et al., 2014).

# Example 5: The Bayesian Network Graph for the "Student Model"

"... now consider a slightly more complex scenario. The student’s grade, in this case, depends not only on his intelligence but also on the difficulty of the course, represented by a random variable *D* whose domain is *Val(D) = {easy, hard}*. Our student asks his professor for a recommendation letter. The professor is absentminded and never remembers the names of her students. She can only look at his grade, and she writes her letter for him based on that information alone. The quality of her letter is a random variable *L*, whose domain is *Val(L) = {strong, weak}*. The actual quality of the letter depends stochastically on the grade. (It can vary depending on how stressed the professor is and the quality of the coffee she had that morning.)

We therefore have five random variables in this domain: the student’s intelligence (*I*), the course difficulty (*D*), the grade (*G*), the student’s SAT score (*S*), and the quality of the recommendation letter (*L*). All of the variables except *G* are binary-valued, and *G* is ternary-valued. Hence, the joint distribution has 48 entries.

As we saw in our simple illustrations ..., a Bayesian network is represented using a directed graph whose nodes represent the random variables and whose edges represent direct influence of one variable on another. We can view the graph as encoding a generative sampling process executed by nature, where the value for each variable is selected by nature using a

distribution that depends only on its parents. In other words, each variable is a stochastic function of its parents.

Based on this intuition, perhaps the most natural network structure for the distribution in this example is the one presented in the figure (left). The edges encode our intuition about the way the world works. The course difficulty and the student’s intelligence are determined independently, and before any of the variables in the model. The student’s grade depends on both of these factors. The student’s SAT score depends only on his intelligence. The quality of the professor’s recommendation letter depends (by assumption) only on the student’s grade in the class. Intuitively, each variable in the model depends directly only on its parents in the network. We formalize this intuition later." (Koller & Friedman, Probabilistic Graphical Models, 2009, p.52f)

# Example 5: Bayesian network with local CPDs of "Student Model"

"The second component of the Bayesian network representation is a set of *local probability models* that represent the nature of the dependence of each variable on its parents. One such model, *P(I)*, represents the distribution in the population of intelligent versus less intelligent student. Another, *P(D)*, represents the distribution of difficult and easy classes. The distribution over the student’s grade is a conditional distribution *P(G | I, D)*. It specifies the distribution over the student’s grade, inasmuch as it depends on the student’s intelligence and the course difficulty. Specifically, we would have a different distribution for each assignment of values *i, d*.

For example, we might believe that a smart student in an easy class is 90 percent likely to get an A, 8 percent likely to get a B, and 2 percent likely to get a C. Conversely, a smart student in a hard class may only be 50 percent likely to get an A. In general, each variable *X* in the model is associated with a *conditional probability distribution (CPD)* that specifies a distribution over the values of *X* given each possible joint assignment of values to its parents in the model.

For a node with no parents, the CPD is conditioned on the empty set of variables. Hence, the CPD turns into a marginal distribution, such as *P(D)* or *P(I)*. One possible choice of CPDs for this domain is shown in figure (left). The network structure together with its CPDs is a *Bayesian network* *B*; we use *B-student* to refer to the Bayesian network for our student example. How do we use this data structure to specify the joint distribution? Consider some particular state in this space, for example, *i1, d0, g2, s1, l0*. Intuitively, the probability of this event can be computed from the probabilities of the basic events that comprise it: the probability that the student is intelligent; the probability that the course is easy; the probability that a smart student gets a B in an easy class; the probability that a smart student gets a high score on his SAT; and the probability that a student who got a B in the class gets a weak letter. The total probability of this state is:

*P(i1, d0, g2, s1, l0)* = *P(i1)P(d0)P(g2 | i1, d0)P(s1 | i1)P(l0 | g2) *= 0.3*0.6*0.08*0.8*0.4 = 0.004608.

Clearly, we can use the same process for any state in the joint probability space. In general, we will have that

*P(I, D, G, S, L) *= *P(I)P(D)P(G | I, D)P(S | I)P(L | G)*.

This equation is our first example of the *chain rule for Bayesian networks* which we will define in a general setting in section 3.2.3.2." (Koller & Friedman, Probabilistic Graphical Models, 2009, p.53f)

Here is a summary of the domains:

Val(D) = <d0, d1> = <easy, hard>

Val(I) = <i0, i1> = <non smart, smart>

Val(G) = <g0, g1, g2> = <A, B, C> = <excellent, good, average>

Val(S) = <s0, s1> = <low score, high score>

Val(L) = <l0, l1> = <weak recomm. letter, strong recomm. letter>

# Example 5: PROB-Code for Bayesian network "Student Model"

Andrew D. Gordon, Thomas A. Henzinger, Aditya V. Nori, and Sriram K. Rajamani. 2014. Probabilistic programming. In *Proceedings of the on Future of Software Engineering* (FOSE 2014). ACM, New York, NY, USA, 167-181. DOI=10.1145/2593882.2593900 doi.acm.org/10.1145/2593882.2593900

Here is a summary of the domains as defined by Koller & Friedman (2009, p52f):

Val(D) = <d0, d1> = <easy, hard>

Val(I) = <i0, i1> = <non smart, smart>

Val(G) = <g1, g2, g3> = <A, B, C> = <excellent, good, average>

Val(S) = <s0, s1> = <low score, high score>

Val(L) = <l0, l1> = <weak recomm. letter, strong recomm. letter>

These domains and the CPDs of the original Koller & Friedman model were modified by Gordon et al. in subtle ways as we show below.

"The figure shows an example Bayesian Network with 5 nodes and corresponding random variables (1) Difficulty, (2) Intelligence, (3) Grade, (4) SAT and (5) Letter. The example is adapted from [Koller & Friedman, 2009] and describes a model which relates the intelligence of a student, the difficulty of a course he takes, the grade obtained by the student, the SAT score of the student, and the strength of the recommendation letter obtained by the student from the professor. We denote these random variables with their first letters *I, D, G, S* and *L* respectively in the discussion below.

Each of the random variables in this example are discrete random variables and take values from a finite domain. Consequently, the CPD at each node *v* can be represented as tables, where rows are indexed by values of the random variables associated with *Deps(v)* and the columns are the probabilities associated with each value of *Xv*. For example, the CPD associated with the node for the random variable *G* has 2 columns associated with the 2 possible values of *G *namely *g0, g1*, and 4 rows corresponding to various combinations of values possible for the direct dependencies of the node namely namely *i0, d0, i0, d1, i1, d0*, and *i1, d1*.

The graph structure together with the CPD at each node specifies the joint probability distribution over *I, D, G, S and L* compactly. The probability of a particular state in the joint distribution can be evaluated by starting at the leaf nodes of the Bayesian Network (that is the nodes at the “top” without any inputs) and proceeding in topological order. For instance

P(i1, d0, g1, s1, l0) =

P(i1)*P(d0)*P(g1 | i1, d0)*P(s1 | i1)*P(l0 | g1) =

0.3*0.6*0.1*0.8*0.4 = 0.00576" (Gordon et al., 2014)

The CPDs of Koller's and Friedman's model were modified by Gordon et al. in many aspects:

1. Val(D) was changed from <g1, g2, g3> to <g0, g1>

2. in P(G | D, I) the numbers in columns g2 and g3 of Koller's model were added.

3. row 2 in P(G | D, I) is wrongly indexed 'i1, d1'. This should be 'i0, d1'

4. in P(L | G), the row g3 in Koller's CPD was dropped

5. in P(L | G), the probabilities in Koller's row g1 were swapped. The probabilities in row g0 are left untouched. The question is whether this is a slip or a fundamental change. We think that by this local swap the **meaning of l0 and l1 should be reversed !** The meaning of l0 is now "strong_letter" and of l1 is now "weak letter" !

Here is a summary of the domains due to the modifications of Gordon et al. (2014):

Val(D) = <d0, d1> = <easy, hard>

Val(I) = <i0, i1> = <non smart, smart>

Val(G) = <**g0, g1**> = <A, **B+C**> = <excellent, **good+average**>

Val(S) = <s0, s1> = <low score, high score>

Val(L) = <l0, l1> = <**strong_letter, weak_letter**>

# Example 5: PROB-Code for Bayesian Network "Student Model"

Andrew D. Gordon, Thomas A. Henzinger, Aditya V. Nori, and Sriram K. Rajamani. 2014. Probabilistic programming. In *Proceedings of the on Future of Software Engineering* (FOSE 2014). ACM, New York, NY, USA, 167-181. DOI=10.1145/2593882.2593900 doi.acm.org/10.1145/2593882.2593900

"The figure (left) shows how to encode the Bayesian "Student" Network as a probabilistic program in PROB. Note that the program just traverses the Bayesian Network in topological order and performs evaluation according to the CPD at each node." (Gordon et al., 2014)

Here is a summary of the domains due to the modifications of Gordon et al. (2014):

Val(D) = <d0, d1> = <easy, hard>

Val(I) = <i0, i1> = <non smart, smart>

Val(G) = <**g0, g1**> = <A, **B+C**>

= <excellent, **good+average**>

Val(S) = <s0, s1> = <low score, high score>

Val(L) = <l0, l1> = <**strong_letter, weak_letter**>

# Example 5: CHURCH-Code for Bayesian Network "Student Model"

The PROB-code snippet from Gordon et al. is translated by us to a functional CHURCH program to clarify its semantics. The generative model is contained in the CHURCH function "take-a-sample". The number of samples taken was set to 70000 in this run. This number could in principle be increased to get a better precision of estimates but WebCHURCH sends the message 'stack overflow'. The sampling method used is the simple-to-understand 'forward sampling'. The screen-shot presented was generated by using the PlaySpace environment of WebCHURCH.

For interpretation of results we present the modified domains of the random variables. These are due to (un-)intentional modifications of Gordon et al., 2014.

Here is a summary of the modified domains due to Gordon et al. (2014):

Val(D) = <d0, d1> = <easy, hard>

Val(I) = <i0, i1> = <non smart, smart>

Val(G) = <**g0, g1**> = <A, **B+C**> = <excellent, **good+average**>

Val(S) = <s0, s1> = <low score, high score>

Val(L) = <l0, l1> = <**strong_letter, weak_letter**>