# Introduction to building Bayesian networks

Laura Uusitalo, May 2006, May 2007
laura.uusitalo@iki.fi

This page is a progressive tutorial to developing Bayesian networks, especially with the Hugin software. Comments are very welcome. Please note, however, that this page is intended to support lectures, not to serve as self-access study material.

This tutorial introduces the use of Hugin software. A restricted (both in functionality and user rights) demo version can be downloaded here.

## Introduction

Bayesian networks are fully probabilistic models that consist of variables and probabilistic links between the variables. Each of the variables has a probability distribution describing our degree of belief on the possible values the variable can have. Probabilistic links between the variables are described as conditional probability distributions. They denote what we think of the probabilities of different values of the variable given that we know the value of another variable that has an effect on its value. Charniak (1991) (pdf file) gives an excellent introduction to Bayesian networks, and Jensen (2001) a good introduction that goes well with the Hugin software. It is recommended to read one of these first, since this presentetation in its current form doesn't get into the fundamentals of Bayesian networks but is, rather, a practical guide to building them. Uusitalo (2007) may be an useful guide to the practical side of the BN modelling, as well.

In the Hugin software (and in most of the other software packages as well), a variable is denoted by an oval, and links between the variables as arrows (often called arcs). If The value of variable B dependends on what the value of variable A is, there should be an arc from A to B.

## In-built distributions and functions

In the basic mode, the (conditional) probability distributions are defined manually; that is, the user explicitly stater his/her degree of belief on each of the various values the variable can have. However, if the values follow certain parametric distributions, or if they can be calculated form some other values, these tasks can be automatised in Hugin using in-built distributions and functions.

### Node types

There are four different node types in Hugin:

• Numbered
• exact, discrete values
• don't need to be integers
• Interval
• consists of intervals on a continuous scale
• starting and end points can be freely chosen, but the intervals have to cover all of the values in between
• the break-points don't need to be integers
• positive and negative infinity can be used as the endpoints (inf and -inf)
• Boolean
• has two states: false and true (cannot be renamed)
• Labelled
• each state has a label, and the labels don't (necessarily) have any numerical interpretation

Numbered and interval types are referred as "numeric". If parametric probability distributions or artihmetic calculus is applied, the node type needs to be numeric, and in cases where the values can be other than integers, node type must be interval.

TRY:

Double-click the node so that a window pops up. Choose the Node tab (should be the one on the top), and use the drag-and-drop selection of Type

OR

Ctrl-click the node to see the Tables pane; then choose Functions → Set type

### Functions and operators

There is a collection of built-in arithmetic operations and statistical distributions in Hugin that can considerably help in creating the conditional probability tables (CPT). The tables naturally need to be numeric, and have to be able to handle all of the possible outcomes of the expressions or distributions. The operators and functions available include:

• mathematical operators and functions such as:
• subtraction
• multiplication
• logarithm
• exponential
• sine
• comparison operators such as
• equals
• not equals
• less than
• If-Then-Else
• if(condition, if_true, if_false)
• logical operators
• and
• or
• not

For a comprehensive list of operators and functions with examples, see Operators and functions.

TRY:

First, ctrl-click the node to see the Tables pane. Then choose:

Functions → Expressions → Switch to expressions; then:

Functions → Expressions → Build expression → Artihmetic operations or

Functions → Expressions → Build expression → If-then-else

TRY:

Build a model in which you have the number of men and women, and define the amount of people as men + women. Note that the maximum value of people has to be equal or higher than max(men) + max(women).

Build a model in which the recruitment depends on the spawning stock size (SSB) and on the environmental quality in that year in the following manner: if the environmental quality is bad, then the recruitment will be just 70% of (0.7 times) the SSB; otherwise, the recruitment will be 300% of (3 times) the SSB.

### Distributions

You can also use in-built statistical distributions to define your probability tables. The program will compute the correct probabilities for each bin given the parameters of the distribution. Note that if you use a distribution that goes to infinity, such as the Gaussian (normal) distribution, you'll need to define your bins so that they go to infinity as well. The symbol for infinity in Hugin is inf; for example, a bin from one thousand to infinity would be 1000 - inf.

Note also that the node type needs to be defined correctly. In order to be able to apply any continuous distributions, the node type needs to be Interval, and to apply discrete distributions, Numbered or Interval.

The currently available distributions include:

• Continuous:
• Normal
• Beta
• Gamma
• Exponential
• Weibull
• Uniform
• Discrete:
• Binomial
• Poisson
• Negative Binomial
• Geometric
TRY:

First, ctrl-click the node to see the Tables pane. Then choose:

• Functions → Expressions → Switch to expressions
• Functions → Expressions → Build expression → Discrete distributions OR Continuous distributions

### Model nodes

Model nodes mean that if expressions can be split into several expressions; instead of stating if(Letter==A, x1, if(Letter==B, x2, if( ))) one can give separate expressions for each of the letters. The computation is no different to the complex if expression, but it is perhaps easier to read and write.

TRY:
• See house_prices.net
• utilises model nodes, in-built distributions and mathematical calculation

TRY:

Look at fishstock.net

• describes a very simple fish stock dynamics model: number of eggs is dependent on environment (temperature, predation...), and the spawning stock size (SSB)
• This year's eggs are the next year's recruits
• fishing takes place after spawning; SSB2 is the spawning stock size after the fishing season
• the variables have labeled values like low, moderate, high, and so on

Now change the values into something more reasonable:

• Change the variables into interval type and give them some reasonable discretization (feel free to use several bins)
• For practical reasons, give SSB and SSB2 the same discretization, as well as to recruitment and next_recruitment
• Create the probability tables for Environment, SSB, Fishing, and recruitment with the help of the in-built distributions
• Create the conditional distributions for Eggs, next_recruitment, and SSB2 using the in-built distributions / arithmetic functions

We'll get back to this example later.

## Decision and utility variables

Bayesian Networks can be enhanced by adding decision and utility nodes, thus turning them into influence diagrams. A decision node describes a decision, i.e. something that can and will be decided by the manager of the system. Decisions such as whether to close the fishery or not, how high TAC to allow, or how much money to invest to fisheries research would be natural to model with decision nodes.

In a decision node, you state all the decision options, which must be mutually exclusive. Probabilities are not assigned to these.

One network can contain several decision variables. In that case, however, there must be a path through all the decision nodes, stating the order in which the decisions are made.

Utility nodes describe the utilities associated with all the possible outcomes. The most straigthtforward way is to describe the utilities as money, but any other measures of utility can be used as well. The only requirement is that the utilities described in one model are commensurate with each other.

Utilities can be either positive or negative; for example, costs can be described as negative utilities and profits as positive utilities. Also, positive utilities could be assigned to fish yields and negative to the collapse of the stock.

Once the decision options and the utilities have been specified, the model can be run. Hugin will then state the expected utilities related to each of the decisions, and the decision associated with the highest expected utility is the optimal. All information entered into the network will be taken into account here, and the recommendations can change when more information is aquired.

TRY:
• Rectangular nodes in the model building window are decision nodes
• Diamond-shaped nodes are the Utility nodes

Take a look at utility model. A management decision about the Total Allowable Catch has been added to the previous fish stock model, and the profits from the fishery have been defined conditionally given the level of fishing. Also, a cost function has ben introduced: if the SSB2 drops to the class "low", this causes costs (for example, through increased assessment costs). So, we wish to be able to fish heavily, but do not wish to risk the stock size dropping to "low".

See if, and how, the preferred decision changes if you know that SSB is a) high b) moderate c) low.

• Add decision and utility variables to your own, modified fish stock model
• Examine the behaviour of your model. Is it reasonable in the light of your assumptions?

## Instance nodes

Object-oriented, or hierarchical, Hugin networks can contain instance nodes in addition to ordinary nodes. An instance node is a node representing an instance of another network; i.e. it is a representation of a subnetwork. The networks of which instances exist in other networks can also contain instance nodes themselves, so hierarchical representation of a domain can be built using instance nodes. Using instance nodes often makes the network a lot less cluttered and helps communicating the ideas behind the network. Instance nodes can also be useful if the same pattern is repeated in the model several times.

Building hierarchical networks requires care. An instance node should be thought of as a copy of the original subnetwork. It connects to other nodes via some of the (basic) nodes in the subnetwork; these nodes are called interface nodes. There are two types of interface nodes: output and input.

Output nodes are visible also outside of the instance, and they can be used as parents of other nodes in the main network. All of the variables that need to be parents of other variables in the main network must be defined as output nodes.

Input nodes are best thought of as placeholders for such variables outside the subnetwork which are parents of variables in the subnetwork. Thestructure of the probability tables of these placeholders must be identical to the tables in the actual node in the main network (i.e. the discretization or labels have to be identical, but the probability distributions don't.) When the input node (the placeholder node) is connected to the node outside the subnetwork, the probability distributions become identical, i.e. the input node will just reflect the values of the actual variable in the main network. If the input node doesn't have parents, the default distribution of the input node will be used, and that's why the input nodes need to have a probability distribution of their own.

TRY:

Transform the utility model (previous example) into an instance node by defining SSB, recruitment and TAC as input nodes and next_recruitment and SSB2 as output nodes.

• Double-click the node to get the node properties poop-up box
• Check "input" or "output" box on the "Node" tab (should be the onne on the top)

note that only variables without parents can be input nodes.

TRY:

Make a multi-year model using your utility model as the model of one year: create several instances of it with the instance node tool and link them together.

• SSB2 is the SSB in the end of year 1 → it's equal to SSB in the beginning of year 2 → SSB2 of year 1 can be directly copied into the SSB of year 2, etc.
• We can define SSB as a input node and SSB2 as an output node
• In order to do this, the possible values, i.e. the discrete bins, need to be exactly similar
• next_recruitment of year 1 is, similarly, the recruitment of year 2
• next_recruitment is an output node, recruitment is an input node
• There must be a direct path through all decision nodes. The easies way to do this is to turn the TAC into an input node, then into the multi-year model create the yearly TAC decisions, which then will be linked to the TACs in the submodels. A direct path can be drawn through the TAC decisions on the multi-year model, indicating the order in which the decisions are made.

You can take a look at utility.oobn, one-year subnetwork with the input and output nodes defined, and utility_multi.oobn, an multi-year object oriented BN which uses the utility model as one time slice.

TECHNICAL DETAILS:
• Instance nodes are suitable if and only if we wish to make multiple exactly similar copies of a network; not only the structure but also the parameters
• The network of which an instance is to be made must be open, and they can't be closed as long as the oo network is open
• drawback: you can't (apparently) perform em learning on oo nets

## Learning from data with Hugin

### Discretizing data

For the purposes of modelling with Hugin, continuous variables need to be discretized, since only very limited computational methods exist for dealing with continuous variables in Bayesian networks.

The discretization of the data needs to be considered carefully. Things to take into account include the number of bins (discrete classes) and the break-points of the bins. On one hand, the more bins there are, the more detailed inference can be drawn; on the other hand, the amount of conditional probabilities increases exponentially as the amount of the bins increases. This means also that the amount of data available needs to be divided among more and more probability tables.

It's good to consider the break-points of the bins carefully as well. The ranges of values belonging to one class need to be scientifically meaningful. For example, it may be reasonable to set the bin break-points of nutrient concentration so that they reflect the commonly accepted limits for oligo-, meso-, and eutrophic waters.

How to do the discretization in practise? Hugin has a discretization engine within the Learning Wizard; the original, continuous data can be given to Hugin and the discretization can be done there. This works well, but the user interface is not very handy, and this makes the discretization process tedious.

There is a piece of software, discretize.exe, which can be downloaded here. It is a useful tool for discretizing the data before inputting it to Hugin. Any script that does the same is, of course, fine.

TRY:
• Try the Hugin discretizing engine (within the Learning Wizard; Wizards -> Learning Wizard)
• Try the Discretize software (instructions in the readme file)

You can either use your own data (recommended) or the NO2 data. (Explanation of the NO2 data.)

A discretized version of the data

### Structural and EM learning

Learning model structures from data is an area of active research, and although the statistical theory is well understood, the methods are still under development, since their computational requirements are hard (Jensen, 2001 p. 81; Myllymäki et al., 2002). Finding the optimal model structure is computationally a very hard procedure (Myllymäki et al., 2002), and approximation methods are generally used instead.

Hugin uses the so-called constraint-based learning. The constraint-based algorithms search for conditional dependences between each pair of variables, and build the model structure based on them (Steck and Tresp, 1999). Constraint-based learning requires no prior knowledge or input from the user.

Once the model stucture (i.e. which variables are linked, what is the direction of the arcs) is established, conditional probabilities of the model can be estimated from data using an Expectation-Maximization (EM) algorithm (Spiegelhalter et al. 1993, Laurizen 1995). It requires only the model structure to be known beforehand, and iteratively calculates maximum likelihood estimates for the parameters given the data and the model structure.

Unlike many estimation methods, EM algorithms can handle situations with missing observations, whether the data is missing randomly or the absence of an observation is dependent on the states of other variables (Heckerman, 1995). The distributions for the incomplete data can be approximated using Dirichlet distributions.

TRY:

Try learning the model structure and the parameters (the probability distributions) using the Hugin Learning Wizard. Play with the options. You can use the discretized data you created with the Discretize. Does the model structure seem sensible?

TECHNICAL HINT:

If you want to create your own model structure, but use your data to learn the probabilities, the most practical way of doing this is to input the data to Hugin through the Learning Wizard and go throught the dialogue. In the EM learning part, check the "Skip EM learning" box. After this, remove the arcs created by the learning Wizard and create your own model structure. When you are happy with your model structure, use EM learning (em button in the model window) separately to learn the parameters of your model.

## Short introduction to d-separation

An integral property of the Bayesian networks is d-separation, the fact that under certain conditions information in one variable may not update information in other variables.

Consider the following situation (Examples from Jensen, 2001): Figure 1. Serial connection.

Obviously, evidence on A will influence the certainty of B, which then will affect the certainty of C, and vice versa; evidence of C will influence A through B. On the other hand, if the state of B is known, the channel is blocked, and A and C become independent. We say that A and C are d-separated given B, and when the state of B is known, we say that it is instantiated. The connection in Figure 1 is called serial connection.

The connection in Figure 2 is called a diverging connection. Influence can pass between the children of A unless the state of A is known; B and C are d-separated given A. Figure 2. Diverging connection.

As an example of the diverging connection, think of following situation: The gender of a person (A) affects (in statistical terms) both the height of the person (B) and the length of his/her hair (C). If you see a person sitting, and from behind, you can infer something about his/her gender given her haircut, and this, again, updates your belief of the height of the person. On the other hand, if you know the gender of the person, the knowledge about the haircut doesn't change your estimate of his/her height.

The connection between the variables in Figure 3 is called converging, and this situation requires a little more care. If nothing more is known about A than what can be inferred from its parents, then the parents are independent: evidence on one of them doesn't update the probabilities of the others. Knowledge of one possible cause of an event doesn't tell us anything about other possible causes. However, if we know anything about the consequences, then information on one possible cause can tell us something about the other possible causes. This is the explaining away effect. Figure 3. Converging connection.

Explaining away effect can be illustrated by an example: If Jack is late from his work in the morning (A), it might be caused by him oversleeping (B) or his bus being late (C). We know what are the probabilities of Jack oversleeping and the bus he uses being late, and based on them we can calculate the probability that Jack is late from work. The event of Jack's bus being late doesn't affect the probability that Jack oversleeps, or vice versa. Now, if we know that Jack is late, the probability of both him oversleeping and the bus being late increases. However, if we find out that Jack has actually overslept, the probability of the bus being late falls back to its original "background" level. Jack being late is "explained away" by him oversleeping, and we don't need the hypothesis of the bus being late to explain the consequence anymore. In other words, our knowledge about the common consequence of oversleeping and the bus being late is linking the causes so that knowledge about one affects the probability of the other even though this will not happen if we don't know anything about the common consequence of these events.

Notice that in the case of converging connection, the common child of the variables needs not be instantiated - it's enough that it has received some evidence, even though the state of it would not be exactly known. This means that the variables B and C become d-connected once either A or one of A's children receive evidence.

The Markov blanket of variable A is the set of nodes consisting of the parents of A, the children of A, and the variable sharing a child with A. If all variables in the Markov blanket for A are instantiated, then A is d-separated from the rest of the network.

The properties of d-separation are used in the analysis of the networks. They may also aid in designing the model: posing questions like "If I know the value of A, information of B cannot influence C at all. Is this the way things should be?" may help in determining the correct structure for the model.

In practise, it is often advisable to build the models so that the arrows follow the direction of natural causal connections (even though causality is not strictly speaking necessary in Bayesian networks). If causality is followed in the model structure, also the correct d-separation properties usually follow naturally. The questions to ask are "What do these values depend on?" and "Which variables have an effect on the values of this one?" For example, it is easy to see that the catch depends at least on the stock abundance and the fishing effort; hence, there should be arrows from effort and stock size to catch. Especially, try to avoid thinking in terms of "flow of information"; i.e. do not draw the arrows thinking "the catch gives me information about the stock size so there should be an arrow from catch to stock size" since this will most often lead to incorrect model structure.

## References

• Charniak, E. 1991. Bayesian networks without tears. AI Magazine 12(4): 50-63.
• Heckerman, D. 1995. A tutorial on learning with Bayesian networks. Technical report MSR-TR-95-06, Microsoft Research.
• Jensen, F.V. 2001. Bayesian networks and decision graphs. Springer-Verlag, New York. ISBN 0-387-95259-4.
• Laurizen, S.L. 1995. The EM algorithm for graphical association models with missing data. Computational Statistics & Data Analysis 19: 191-201.
• Myllymäki, P., Silander, T., Tirri, H., and Uronen, P. 2002. B-Course: A web-based tool for Bayesian and causal data analysis. International Journal on Artificial Intelligence Tools 11(3): 369-387.
• Spiegelhalter, D.J., Dawid, A.P., Laurizen, S.L., and Cowell, R.G. 1993. Bayesian analysis in expert systems. Statistical Science 8(3): 219-247.
• Steck, H., and Tresp, V. 1999. Bayesian Belief Networks for Data Mining, Proceedings of The 2nd Workshop on Data Mining und Data Warehousing als Grundlage Moderner Entschidungsunterstuezender Systeme, DWDW99, Sammelband, Universität Magdeburg, September 1999.
• Uusitalo, L. 2007. Advantages and challenges of Bayesian networks in environmental modelling. Ecological Modelling 203(3-4): 312-318.