Ai99 Tutorial 4

Nicholson & Korb 1 Nicholson & Korb 2
Overview
Bayesian AI
1. Introduction to Bayesian AI (20 min)
AI’99, Sydney
2. Bayesian networks (50 min)
6 December 1999
Break (10 min)
Ann E. Nicholson and Kevin B. Korb
3. Applications (50 min)
School of Computer Science Break (10 min)

and Software Engineering
4. Learning Bayesian networks (50 min)
Monash University, Clayton, VIC 3168
AUSTRALIA 5. Current research issues (10 min)
fannn,korbg@csse.monash.edu.au 6. Bayesian Net Lab (60 min: Optional)
7. Dinner (Optional)
Bayesian AI Tutorial Bayesian AI Tutorial
Introduction to Bayesian AI Reasoning under uncertainty
Reasoning under uncertainty

Uncertainty: The quality or state of being not
clearly known.
Probabilities This encompasses most of what we understand
Alternative formalisms about the world — and most of what we would
like our AI systems to understand.
– Fuzzy logic
Distinguishes deductive knowledge (e.g.,
– MYCIN’s certainty factors
mathematics) from inductive belief (e.g.,
– Default Logic science).
Bayesian philosophy
Sources of uncertainty
– Dutch book arguments
Ignorance
– Bayes’ Theorem (which side of this coin is up?)
– Conditionalization
Physical randomness
– Confirmation theory (which side of this coin will land up?)
Bayesian decision theory Vagueness
Towards a Bayesian AI
(which tribe am I closest to genetically?
Picts? Angles? Saxons? Celts?)

Fuzzy Logic
Probabilities Designed to cope with vagueness:

Is Fido a Labrador or a Shepard?
Classic approach to reasoning under uncertainty. Fuzzy set theory:
(Blaise Pascal and Fermat). m(Fido 2 Labrador) = m(Fido 2 Shepard) = 0:5
Kolmogorov’s Axioms: Extended to fuzzy logic, which takes intermediate
truth values: T (Labrador(Fido)) = 0:5.
1. P (U ) = 1
Combination rules:
2. 8X U P (X ) 0
T (p ^ q) = min(T (p); T (q))
3. 8X; Y U
if X \ Y = ; T (p _ q) = max(T (p); T (q))
then P (X ^ Y ) = P (X ) + P (Y )
T (:p) = 1 T (p)
Conditional Probability P (X jY ) = P (X ^Y )
P (Y ) Not suitable for coping with randomness or
Independence X q Y i P (X jY ) = P (X ) ignorance. Obviously not:
Uncertainty(inclement weather) =
max(Uncertainty(rain),Uncertainty(hail),. . . )
MYCIN’s Certainty Factors

Default Logic
Uncertainty formalism developed for the early
expert system MYCIN (Buchanan and Shortliffe, Intended to reflect “stereotypical” reasoning
1984): under uncertainty (Reiter 1980). Example:
Elicit for (h; e):
Bird(Tweety) : Bird(x) ! Flies(x)
measure of belief:MB (h; e) 2 [0; 1] Flies(Tweety)
measure of disbelief: MD(h; e) 2 [0; 1] Problems:
CF (h; e) = MB (h; e) MD(h; e) 2 [ 1; 1] Best semantics for default rules are

probabilistic (Pearl 1988, Korb 1995).
Special functions provided for combining Mishandles combinations of low probability
evidence. events. E.g.,
ApplyforJob(me) : ApplyforJob(x) ! Reject(x)

Problems:
No semantics ever given for ‘belief’/‘disbelief’ Reject(me)

Heckerman (1986) proved that restrictions I.e., the dole always looks better than
required for a probabilistic semantics imply applying for a job!
absurd independence assumptions.

Probability Theory A Dutch Book
Payoff table on a bet for h

So, why not use probability theory to represent
(Odds = p=1 p; S = betting unit)
uncertainty?
That’s what it was invented for. . . dealing with h Payoff
physical randomness and degrees of ignorance. T $(1-p) S
Furthermore, if you make bets which violate F -$p S
probability theory, you are subject to Dutch
Given a fair bet, the expected value from such a
books:
payoff is always $0.
A Dutch book is a sequence of “fair” bets Now, let’s violate the probability axioms.
which collectively guarantee a loss.
Example
Fair bets are bets based upon the standard
Say, P (A) = 0:1 (violating A2)
odds-probability relation:
Payoff table against A (inverse of: for A),
O(h) = 1 P (Ph()h) with S = 1:
:A Payoff
P (h) = 1 +O(Oh()h) T $pS = -$0.10
F -$(1-p)S = -$1.10
Bayes’ Theorem; Bayesian Decision Theory

Conditionalization
— Frank Ramsey (1931)
— Due to Reverend Thomas Bayes (1764) Decision making under uncertainty: what action
to take (plan to adopt) when future state of the
P (hje) = P (ePjh()eP) (h) world is not known.
Bayesian answer: Find utility of each possible
Conditionalization: P 0 (h) = P (hje) outcome (action-state pair) and take the action
that maximizes expected utility.
Or, read Bayes’ theorem as:
Example
Posterior = Likelihood Prior Action Rain (p = .4) Shine (1 - p = .6)

Prob of evidence Take umbrella 30 10
Leave umbrella -100 50
Assumptions:
Expected utilities:
1. Joint priors over fhi g and e exist. E(Take umbrella) = (30)(.4) + (10)(.6) = 18
2. Total evidence: e, and only e, is learned.

E(Leave umbrella) = (-50)(.4) + (100)(.6) = 40

Bayesian AI
A Bayesian conception of an AI is:

An autonomous agent which Bayesian Networks: Overview
Has a utility structure (preferences)
Syntax
Can learn about its world and the relation
between its actions and future states Semantics
(probabilities) Evaluation methods
Maximizes its expected utility Influence diagrams (Decision Networks)
The techniques used in learning about the world Dynamic Bayesian Networks
are (primarily) statistical. . . Hence
Bayesian data mining
Bayesian Networks
Example: Earthquake
Data Structure which represents the
(Pearl,R&N)
dependence between variables;
Gives concise specification of the joint You have a new burglar alarm installed.
probability distribution. It is reliable about detecting burglary, but
A Bayesian Network is a graph in which the responds to minor earthquakes.
following holds: Two neighbours (John, Mary) promise to call
1. A set of random variables makes up the you at work when they hear the alarm.
nodes in the network. – John always calls when hears alarm, but
2. A set of directed links or arrows connects confuses alarm with phone ringing (and
pairs of nodes. calls then also)
3. Each node has a conditional probability – Mary likes loud music and sometimes
table that quantifies the effects the misses alarm!
parents have on the node.
Given evidence about who has and hasn’t
4. Directed, acyclic graph (DAG), i.e. no called, estimate the probability of a burglary.
directed cycles.

Earthquake Example: Notes
Assumptions: John and Mary don’t perceive

Earthquake Example: burglary directly; they do not feel minor
earthquakes.
Network Structure
Note: no info about loud music or telephone
ringing and confusing John. Summarised in
uncertainty in links from Alarm to
JohnCalls and MaryCalls.
P(E)

Burglary Earthquake
0.02
Once specified topology, need to specify
P(B)
B E P(A|B,E) conditional probability table (CPT) for
0.01 Alarm
T T 0.95 each node.
T F 0.94
F T 0.29
F F 0.001
– Each row contains the cond prob of each
JohnCalls MaryCalls
node value for a conditioning case.
A P(J|A) A P(M|A)
– Each row must sum to 1.
T 0.90 T 0.70
F 0.05 F 0.01 – A table for a Boolean var with n Boolean
parents contain 2n+1 probs.
– A node with no parents has one row (the
prior probabilities)
Representing the joint

probability distribution
Semantics of Bayesian
P (X1 = x1 ; X2 = x2 ; :::; Xn = xn )
Networks
= P (x1 ; x2 ; :::; xn)
A (more compact) representation of the joint = P (x1 ) P (x2 jx1 )::: P (xn jx1 ^ :::xn 1 )
probability distribution.
– helpful in understanding how to construct = i P (xi jx1 ^ :::xi 1 )

network
Encoding a collection of conditional = i P (xi j(Xi ))

independence statements.
– helpful in understanding how to design Example: P (J ^ M ^ A ^ :B ^ :E )

inference procedures
= P (J jA)P (M jA)P (Aj:B ^ :E )P (:B )P (:E )
= 0:9 0:7 0:001 0:999 0:998 = 0:0067:

Compactness and Node

Network Construction Ordering
1. Choose the set of relevant variables Xi that Compactness of BN is an example of a locally

describe the domain. structured (or sparse) system.
2. Choose an ordering for the variables. The correct order to add nodes is to add the
“root causes” first, then the variable they
3. While there are variables left: influence, so on until “leaves” reached.
(a) Pick a variable Xi and add a node to the Examples of wrong ordering (which still
network for it. represent same joint distribution):
(b) Set (Xi ) to some minimal set of nodes
already in the net such that the 1. MaryCalls, JohnCalls, Alarm, Burglary,
conditional independence property is Earthquake.
satisfied. MaryCalls
P (XijXi 1; :::; X1) = P (Xij(Xi))

JohnCalls
(c) Define the CPT for Xi . Alarm
Burglary Earthquake
Conditional Independence:
Compactness and Node Causal Chains
Ordering (cont.)
2. MaryCalls, JohnCalls, Earthquake, Burglary,

Causal chains give rise to conditional
Alarm.
independence:
MaryCalls
JohnCalls
A B C
Earthquake
P (C jA ^ B ) = P (C jB )
Burglary Alarm
Example
More probabilities than the full joint! A = Jack’s flu
B = severe cough
See below for why.
C = Jill’s flu

Conditional Independence: Conditional Dependence:

Common Causes Common Effects
Causal causes (or ancestors) also give rise to Common effects (or their descendants) give rise
conditional independence: to conditional dependence:
A C
A C
P (AjC ^ B ) 6= P (A)P (C )
P (C jA ^ B ) = P (C jB ) Example
Example A = flu
B = severe cough
A = Jack’s flu
C = tuberculosis
B = Joe’s flu
Given a severe cough, flu “explains away”
C = Jill’s flu tuberculosis.
D-separation
Graph-theoretic criterion of conditional

Causal Ordering
independence.
We can determine whether a set of nodes X is

independent of another set Y, given a set of Why does variable order affect network density?
evidence nodes E, i.e., X q Y jE ”.
Because
Earthquake example
Using the causal order allows direct
Burglary Earthquake
representation of conditional independencies
Violating causal order requires new arcs to

Alarm
re-establish conditional independencies
JohnCalls MaryCalls

Inference in Bayesian
Causal Ordering (cont’d) Networks
Basic task for any probabilistic inference

Flu TB system:
Compute the posterior probability
distribution for a set of query variables,
Cough given values for some evidence variables.
Flu and TB are marginally independent. Also called Belief Updating.
Types of Inference:
Given the ordering: Cough, Flu, TB:
Q E Q E E
Cough
TB
E Q
Flu
(Explaining Away)
Marginal independence of Flu and TB must be
re-established by adding Flu ! TB or Flu
Intercausal
TB E Q E
Diagnostic Causal Mixed
Kinds of Inference
Inference Algorithms:
Diagnostic inferences: from effect to causes. Overview
P(Burglary|JohnCalls)
Causal Inferences: from causes to effects. Exact inference
P(JohnCalls|Burglary) – Trees and polytrees: message-passing

P(MaryCalls|Burglary) algorithm
– Multiply-connected networks:
Intercausal Inferences: between causes of a
common effect. Clustering
P(Burglary|Alarm) Approximate Inference

P(Burglary|Alarm ^ Earthquake) – Large, complex networks:
Mixed Inference: combining two or more of Stochastic Simulation
above. Other approximation methods
P(Alarm|JohnCalls ^ :EarthQuake) In the general case, both sorts of inference

P(Burglary|JohnCalls ^ are computationally complex (“NP-hard”).
:EarthQuake)

Message Passing Example Inference in multiply

P(B)
Burglary Earthquake
P(E) connected networks
0.01 0.02
PhoneRings Alarm
B
T T
E P(A)
0.95
Networks where two nodes are connected by
P(Ph)
T F 0.94 more than one path
F T 0.29
0.05 F F 0.001
JohnCalls MaryCalls – Two or more possible causes which share a
P A P(J) A P(M) common ancestor
T T 0.95 T 0.70
T F 0.5 F 0.01
– One variable can influence another
F T 0.90
F F 0.01 through more than one causal mechanism
Example: Cancer network
Metastatic Cancer
π(Β) = (.001,.999) π(Ε) = (.002,.998) A
λ (Β) = (1,1) λ (Ε) = (1,1)
bel(B) = (.001, .999) bel(E) = (.002, .998)
Brain tumour
B λ A (B) E B C
λ A (E)
bel(Ph) = (.05, .95) Increased total
π A (B) π A (E)
π(Ph) = (.05,.95) serum calcium
D E
λ(Ph) = (1,1) Ph λ J (Ph) A
λ J (A) π M(A) Severe Headaches
π J (Ph) λ M(A) Coma
π J (A)
J M
Message passing doesn’t work - evidence gets
λ (J) = (1,1) λ (M) = (1,0)
“counted twice”
Clustering methods
Clustering methods (cont.)
Transform network into a probabilistically
equivalent polytree by merging (clustering)
offending nodes Jensen Join-tree (Jensen, 1996) version the
current most efficient algorithm in this class
Cancer example: new node Z combining B (e.g. used in Hugin, Netica).
and C
Network evaluation done in two stages
A
– Compile into join-tree
May be slow
May require too much memory if
Z=B,C original network is highly connected
– Do belief updating in join-tree (usually
E fast)
D
Caveat: clustered nodes have increased
P (z ja) = P (b; cja) = P (bja)P (cja) complexity; updates may be computationally
complex
P (ejz ) = P (ejb; c) = P (ejc)
P (djz ) = P (djb; c)
Approximate inference with

stochastic simulation
Making Decisions
Use the network to generate a large number
of cases that are consistent with the network Bayesian networks can be extended to
distribution. support decision making.
Evaluation may not converge to exact values Preferences between different outcomes of
(in reasonable time). various plans.
Usually converges to close to exact solution – Utility theory

quickly if the evidence is not too unlikely.
Decision theory = Utility theory +
Performs better when evidence is nearer to Probability theory.
root nodes, however in real domains, evidence
tends to be near leaves (Nicholson&Jitnah,
1998)
Type of Nodes
Decision Networks
Chance nodes: (ovals) represent random
A Decision network represents information about variables (same as Bayesian networks). Has
an associated CPT. Parents can be decision
the agent’s current state nodes and other chance nodes.
its possible actions Decision nodes: (rectangles) represent points

where the decision maker has a choice of
the state that will result from the agent’s actions.
action
Utility nodes: (diamonds) represent the agent’s
the utility of that state utility function (also called value nodes in
the literature). Parents are variables
Also called, Influence Diagrams describing the outcome state that directly
(Howard&Matheson, 1981). affect utility. Has an associated table
representing multi-attribute utility function.

Example: Umbrella
Evaluating Decision
Weather
Networks: Algorithm
Forecast U 1. Set the evidence variables for the current

state.
Take Umbrella
2. For each possible value of the decision node
P (Weather = Rainj) = 0:3 (a) Set the decision node to that value.
P (Forecast = RainyjWeather = Rain) = 0:60 (b) Calculate the posterior probabilities for
P (Forecast = CloudyjWeather = Rain) = 0:25 the parent nodes of the utility node (as for
P (Forecast = SunnyjWeather = Rain) = 0:15
BNs).
P (Forecast = RainyjWeather = NoRain) = 0:1 (c) Calculate the resulting (expected) utility
P (Forecast = CloudyjWeather = NoRain) = 0:2 for the action.
P (Forecast = SunnyjWeather = NoRain) = 0:7 3. Return the action with the highest expected
utility.
U (NoRain; TakeUmbrella) = 20
U (NoRain; LeaveAtHome) = 100 Simple for single decision, less so when executing
U (Rain; TakeIt) = 70 several actions in sequence (i.e. a plan).
U (Rain; LeaveAtHome) = 0
Dynamic Belief Networks
State evolution model

Dynamic Decision Network
State t-2 State t-1 State t State t+1 State t+2
Obs t-2 Obs t-1 Obs t Obs t+1
Sensor model
Obs t+2
Similarly, Decision Networks can be extended
to include temporal aspects.
The values of state variables at time t depend Sequence of decisions taken = Plan.
only on the values at t 1.
Can calculate distributions for St+1 and

further: probabilistic projection. Dt Dt+1 Dt+1 Dt+1
Can be done using standard BN updating

State t State t+1 State t+2 State t+3
algorithms
This type of DBN gets very large, very Obs t Obs t+1 Obs t+2 Obs t+3
Ut+3
quickly.
Usually only keep two time slices of the

network.

Bayesian Networks: Summary
Uses of Bayesian Networks Bayes’ rule allows unknown probabilities to

be computed from known ones.
1. Calculating the belief in query variables Conditional independence (due to causal

given values for evidence variables (above). relationships) allows efficient updating
2. Predicting values in dependent variables Bayesian networks are a natural way to

given values for independent variables. represent conditional independence info.
– links between nodes: qualitative aspects;
3. Decision making based on probabilities in the
network and on the agent’s utilities – conditional probability tables:
(Influence Diagrams [Howard and Matheson quantitative aspects.
1981]). Inference means computer the probability
distribution for a set of query variables, given
4. Deciding which additional evidence should be
a set of evidence variables.
observed in order to gain useful information.
Inference in Bayesian networks is very
5. Sensitivity analysis to test impact of changes flexible: can enter evidence about any node
in probabilities or utilities on decisions.
and update beliefs in any other nodes.
The speed of inference in practice depends on

the structure of the network: how many
Applications: Overview
(Simple) Example Networks

loops; numbers of parents; location of
evidence and query nodes. Applications
Bayesian networks can be extended with – Medical Decision Making: Survey of

decision nodes and utility nodes to support applications
decision making: Decision Networks or – Planning and Plan Recognition
Influence Diagrams.
– Natural Language Generation (NAG)
Bayesian and Decision networks can be – Bayesian poker
extended to allow explicit reasoning about
changes over time. Deployed Bayesian Networks (See Handout
for details)
BN Software
Web Resources

Example: Cancer
Metastatic cancer is a possible cause of a brain
Example: Asia
tumor and is also an explanation for increased
total serum calcium. In turn, either of these A patient presents to a doctor with shortness of
could explain a patient falling into a coma. breath. The doctor considers that possibles
Severe headache is also possibly associated with causes are tuberculosis, lung cancer and
a brain tumor. (Example from (Pearl, 1988).) bronchitis. Other additional information that is
Metastatic Cancer relevant is whether the patient has recently
A visited Asia (where tuberculosis is more
Brain tumour prevalent), whether or not the patient is a
B C smoker (which increases the chances of cancer
Increased total
serum calcium and bronchitis). A positive xray would indicate
D E
either TB or lung cancer. (Example from
Coma Severe Headaches
(Lauritzen, 1988).)
P (a) = 0:2 visit to Asia smoking
P (bja) = 0:80 P (bj:a) = 0:20 tuberculosis lung cancer bronchitis
P (cja) = 0:20 P (cj:a) = 0:05 either tub or

lung cancer
P (djb; c) = 0:80 P (dj:b; c) = 0:80 positive X-ray dyspnoea
P (djb; :c) = 0:80 P (dj:b; :c) = 0:05

P (ejc) = 0:80 P (ej:c) = 0:60
Probabilistic reasoning in
Example: A Lecturer’s Life medicine
Dr. Ann Nicholson spends 60% of her work time in her office.
The rest of her work time is spent elsewhere. When Ann is in
See handout from (Dean et al., 1993).
her office, half the time her light is off (when she is trying to
hide from students and get some real work done). When she
Simplest tree-structured network for
is not in her office, she leaves her light on only 5% of the time. diagnostic reasoning
80% of the time she is in her office, Ann is logged onto the
computer. Because she sometimes logs onto the computer – H = disease hypothesis; F = findings
from home, 10% of the time she is not in her office, she is still (symptoms, test results)
logged onto the computer. Suppose a student checks Dr.
Nicholson’s login status and sees that she is logged on. What
effect does this have on the student’s belief that Dr.
Nicholson’s light is on? (Example from (Nicholson, 1999))
in-office
lights-on logged-on
Multiply-connected network (QMR structure)
– B = background information (e.g. age, sex

of patient)

Medical Applications
Pathfinder case study: see handout using

material from (Russell&Norvig, 1995,
pp.457-458).
QMR (Quick Medical Reference): 600

diseases, 4,000 findings, 40,000 arcs.
(Dean&Wellman, 1991)
MUNIN (Andreassen et al., 1989):

neuromuscular disorders, about 1000 nodes;
exact computation < 5 seconds.
Glucose prediction and insulin dose

adjustment (DBN application) (Andreassen et
al., 1991).
CPSC project (Pradham et al., 1994)
– 448 nodes, 906 links, 8254 conditional

probability values
– LW algorithm - answers in 35 mins (1994)
Application of LW to medical diagnosis

(Shwe&Cooper, 1990). Robot Navigation and
Forecasting sleep apnea (Dagum et al., 1993).
Tracking
ALARM (Beinlich et al., 1989): 37 nodes, 42
arcs. (See Netica examples.) Example of a Dynamic Decision Network
MinVolSet (3)
.976
Dean&Wellman, 1991.
Ventmach (4) Disconnect (2)
1.158 .617
PulmEmbolus(2) Intubation (3) .141 VentTube (4) 1.146 KinkedTube(4)
.288 1.180 .227
.369 .140
.428
PAP (3) Shunt (2) Press (4) .098 VentLung (4)
.067 .100 1.201
1.189
FiO2 (2) VentAlv (4) MinVol (4)
.411 .213 .891
.805 .743
PVSat (3) ArtCO2 (3) .362
Anaphylaxis (2)
ExpCO2 (4)
.054 .239
InsuffAnesth (2) SaO2 (3)
.246 TPR (3)
.092 .066
Catechol (2)
.470 LVFailure(2) Hypovolemia (2)

.547 .360
.479 .538
.137
ErrCauter (2) HR (3) ErrLowOutput(2) History (2) StrokeVolume (3) LVEDVolume(3)
.888
.324 .344 .874
.888 .948 .251 .724 .746
HRSat (3) .324 HREKG (3) HRBP (3) CO (3) CVP (3) PCPW (3)
.485
.199
BP (3)

Plan Recognition Applications Traffic Monitoring:

BATmobile
Keyhole plan recognition in an Adventure
game (Albrecht et al., 1998).
A A A A A A A A
(Forbes et al., 1995)
0 1 2 3 0 1 2 3
Example of a DBN
Q Q’ Q Q’
L L L L L L L L
0 1 2 3 0 1 2 3
(a) mainModel (b) indepModel
A A A A
0 1 2 3 Q Q’
Q Q’ L L L L
0 1 2 3
(c) actionModel (d) locationModel
Traffic plan recognition (Pynadeth&Wellman,

1995).
Natural Language Generation
NAG (McConachy et al., 1999) – A Nice

Argument Generator – uses two Bayesian
networks to generate and assess natural
%cE
language arguments: % EE cc
Higher level
%% cc ‘motivation’
concepts like

E
or
% EE cc
‘ability’
Lower level
% c
% E
cc
concepts like

%% E
‘Grade Point Average’
Normative Model: Represents our best %

E
E
c
c
E HB HH ccc

+
@

Semantic
%
Network
E
@
understanding of the domain; proper @
@%
%
EA
E c
2nd layer

%@@
E
Q
EE QQBB
B H HH
c
cc

E A
Q B
A
(constrained) Bayesian updating, given Semantic

%% -@@
E
E Q cc
% R
E
E C c
Network
%% EE C
1st layer

HH
premises. % %
H
HH EE C C
Bayesian

% E C
%%
E
EE
Network

User Model: Represents our best %
% 6 E
understanding of the human; Bayesian Proposition, e.g., [publications authored by person X cited >5 times]
updating modified to reflect human biases

(e.g., overconfidence; Korb, McConachy,
Zukerman, 1997).
BNs are embedded in a semantic hierarchy

1
supports attentional modeling
constrained updating

Bayesian Poker
Bayesian Poker BN
(Korb et al., 1999)
Bayesian network provides an estimate of
Poker is ideal for testing automated reasoning
winning at any point in the hand.
under uncertainty
– Physical randomisation
Betting curves based on pot-odds used to
determine action (bet/call, pass or fold).
– Incomplete hand information
– Incomplete opponent info (strategies, bluffing, BPP Win
etc)
Bayesian networks are a good representation for

OPP Final BPP Final
complex game playing.
M M
Our Bayesian Poker Player (BPP) plays 5-Card
C|F C|F
stud poker at the level of a good amateur human OPP Current BPP Current
player. To play: M M
A|C U|C
telnet indy13.cs.monash.edu.au
login: poker OPP Action OPP Upcards
password: maverick
Bayesian Poker BN (cont.)
Hand Types
Bayesian Poker BN (cont.)
Initial 9 hand types too coarse.
Different networks (matrices) for each round.

We use a finer granularity for most common
hands (busted and a pair):
OPP Current, BPP Current: (partial) hand
– low, medium, Q-high, K-high, A-high
types with cards dealt so far.
– results in 17 hand-types
OPP Final, BPP Final: hand types after all 5
cards dealt. Conditional Probability Matrices
Observation nodes:
MAjC : probability of opponent’s action given
– OPP Upcards: All opponent’s cards except current hand type learned from observed
first are visible to BPP. showdown data.
– OPP Action: BPP knows opponent’s action. MU jC and MCjF estimated by dealing out 107
poker hands.
Belief Updating: Since network is a polytree,

simple fast propagation updating algorithm used.

Current Status, Possible Deployed BNs

Extensions
From Web Site database: See handout for
BPP outperforms automated opponents, is details.
fairly even with ave amateur humans, and
loses to experienced humans.
TRACS: Predicting reliability of military
vehicles.
Learning the OPP Action CPTs does not (yet)
Andes: intelligent tutoring system for
appear to improve performance.
physics.
BN Improvements
Distributed Virtual Agents advising online
– Refine action nodes users on web sites.
– Further refinement of hand types
Information extraction from natural
– Improve network structure language text
– Adding bluffing to the opponent model
DXPLAIN: decision support for medical
– Improved learning of opponent model diagnosis.
More complex poker: multi-opponent games, Illiad: teaching tool for medical students.
table stake games.
Microsoft Health Produce: “find by symptom”
DBN model to represent changes over time
feature.
Weapons scheduling. BN Software: Issues

Monitoring power generation.
Processor fault diagnosis.

Functionality
– Especially application vs API

Knowledge Industries applications: (a) in
medicine, sleep disorders, pathology, trauma Price
care, hand and wrist evaluations, – Many free for demo versions or
dermatology, and home-based health educational use
evaluations (b) in capital equipment,
– Commercial licence costs.
locomotives, gas-turbine engines for aircraft
and land-based power production, the space Availability (platforms)
shuttle, and office equipment.
Quality
Software debuggin.
– GUI
Vista: decision support system used at NASA – Documentation and Help
Mission Control Center.
Leading edge
MS: (a) Answer Wizard (Office 95),
Robustness
Information retrieval; (b) Print
Troubleshooter; (c) Aladdin, troubleshooting – software
customer support. – company

Web Resources
BN Software
Bayesian Belief Network site (Russell

Analytica: www.lumina.com
Greiner):
Hugin: www.hugin.com
www.cs.ualberta.ca/ greiner/bn.html
Netica: www.norsys.com
Bayesian Network Repository (Nir Friedman)
Above 3 available during tutorial lab session.
www-
JavaBayes: nt.cs.berkeley.edu/home/nir/public html/Repository/index.htm
http://www.cs.cmu.edu/ javabayes/Home/ Summary of BN software and links to

software sites (Kevin Murphy)
Many other packages (see next slide)
HTTP.CS.Berkeley.EDU/ murphyk/Bayes/bnsoft.html
Learning Bayesian Networks

Linear and Discrete Models
Learning Network Parameters
– Linear Coefficients
– Learning Probability Tables
Applications: Summary
Learning Causal Structure
Conditional Independence Learning
Various BN structures are available to – Statistical Equivalence
compactly and accurately represent certain – TETRAD II
types of domain features. Bayesian Learning of Bayesian Networks
Bayesian networks have been used for a wide – Cooper & Herskovits: K2
range of AI applications. – Learning Variable Order

– Statistical Equivalence Learners
Robust and easy to use Bayesian network
Full Causal Learners
software is now readily available.
Minimum Encoding Methods
– Lam & Bacchus’s MDL learner
– MML metrics
– MML search algorithms
– MML Sampling
Empirical Results

Linear and Discrete Models
Linear Models: Used in biology & social

sciences since Sewall Wright (1921)
Learning Linear Parameters
Linear models represent causal relationships as
Maximum likelihood methods have been
sets of linear functions of “independent”
available since Wright’s path model analysis
variables.
(1921).
X1 X2
Equivalent methods:
X3
Simon-Blalock method (Simon, 1954; Blalock,
1964)
Equivalently (assuming linear parameters):
Ordinary least squares multiple regression
X3 = a13 X1 + a23 X2 + 1 (OLS)
Discrete models: “Bayesian nets” replace

vectors of linear coefficients with CPTs.
Learning Conditional
Probability Tables
Spiegelhalter & Lauritzen (1990):
assume parameter independence
each CPT cell i = a parameter in a Dirichlet

distribution
D[1; : : : ; i ; : : : ; K ] Dual log-linear and full CPT models (Neil,

for K parents Wallace, Korb 1999).
prob of outcome i is i =K

k=1 k
observing outcome i update D to
D[1; : : : ; i + 1; : : : ; K ]
Others are looking at learning without
parameter independence. E.g.,
Decision trees to learn structure within CPTs

(Boutillier et al. 1996).

Learning Causal Structure

This is the real problem; parameterizing models Statistical Equivalence
is essentially numerical computing.
Verma and Pearl’s rules identify the set of causal
models which are statistically equivalent —
There are two basic methods:
Learning from conditional independencies

(CI learning) Two causal models H1 and H2 are
statistically equivalent iff they contain
Learning using a scoring metric the same variables and joint samples
(Metric learning) over them provide no statistical grounds
for preferring one over the other.
CI learning (Verma and Pearl, 1991)
Examples
Suppose you have an Oracle who can answer yes All fully connected models are equivalent.
or no to any question of the type:
A !B !C and A B C.
X q Y jS? A !B !D C and A B !D C.
Then you can learn the correct causal model, up
to statistical equivalence.
Statistical Equivalence TETRAD II
— Spirtes, Glymour and Scheines (1993)

Chickering (1995):
Replace the Oracle with statistical tests:
Any two causal models over the same
variables which have the same skeleton for linear models a significance test on partial
(undirected arcs) and the same directed correlation
v-structures are statistically equivalent.
X q Y jS i XY S = 0
If H1 and H2 are statistically equivalent,
then they have the same maximum for discrete models a 2 test on the difference
likelihoods relative to any joint samples: between CPT counts expected with
independence (Ei ) and observed (Oi )
max P (ejH1 ; 1) = max P (ejH2 ; 2)
O 2
X q Y jS i i
where i is a parameterization of Hi i Oi ln Ei 0

Bayesian LBN: Cooper &

Herskovits
— Cooper & Herskovits (1991, 1992)

TETRAD II
Compute P (hi je) by brute force, under the
assumptions:
Asymptotically finds causal structure to
within the statistical equivalence class of the 1. All variables are discrete.
true model.
2. Samples are i.i.d.
Requires larger sample sizes than MML (Dai,
3. No missing values.
Korb, Wallace & Wu, 1997):
4. All values of child variables are uniformly
Statistical tests are not robust given
distributed.
weak causal interactions and/or small
samples. 5. Priors over hypotheses are uniform.
Cheap, and easy to use.

With these assumptions, Cooper & Herskovits
reduce the computation of PCH (h; e) to a
polynomial time counting problem.
Learning Variable Order

Cooper & Herskovits
Reliance upon a given variable order is a major
drawback to K2
But the hypothesis space is exponential; they go
for dramatic simplification: And many other algorithms (Buntine
1991, Bouckert 1994, Suzuki 1996,
6. Assume we know the temporal ordering of Madigan & Raftery 1994)
the variables.
What’s wrong with that?
Now for any pair of variables:
We want autonomous AI (data mining). If
either they are connected by an arc

experts can order the variables they can
likely supply models.
or they are not.
Determining variable ordering is half the
Further, cycles are impossible. problem. If we know A comes before B , the
only remaining issue is whether there is a
New hypothesis space has size only 2(n n)=2
2
link between the two.
(still exponential).
The number of orderings consistent with
Algorithm “K2” does a greedy search through this dags is apparently exponential (Brightwell &
reduced space. Winkler 1990). So iterating over all possible
orderings will not scale up.

Statistical Equivalence
Statistical Equivalence
Learners
Learners
Wallace & Korb (1999): This is not right!
Heckerman & Geiger (1995) advocate learning

only up to statistical equivalence classes (a la
These are causal models; they are
distinguishable on experimental data.
TETRAD II).
– Failure to collect some data is no reason to
Since observational data cannot change prior probabilities.
distinguish btw equivalent models, E.g., If your thermometer topped out at
there’s no point trying to go futher. 35 , you wouldn’t treat 35 and 34 as
equally likely.
)Madigan, Andersson, Perlman & Not all equivalence classes are created equal:
Volinsky (1996) follow this advice, use
f A B !C, A !B !C, A B Cg
uniform prior over equivalence classes.
f A !B Cg
)Geiger and Heckerman (1994) define
Bayesian metrics for linear and discrete Within classes some dags should have greater
equivalence classes of models (BGe and priors than others. . . E.g.,
BDe) LightsOn !InOffice !LoggedOn v.
LightsOn InOffice !LoggedOn
Full Causal Learners
So. . . a full causal learner is an algorithm that: MDL
1. Learns causal connectedness. Minimum Description Length (MDL) inference —
2. Learns v-structures.
Invented by Rissanen (1978)
Hence, learns equivalence classes.
based upon Minimum Message Length
3. Learns full variable order.
(MML) invented by Wallace (Wallace
Hence, learns full causal structure (order +
and Boulton, 1968).
connectedness).
Plays trade-off btw
TETRAD II: 1, 2.
– model simplicity
Madigan et al.: 1, 2. – model fit to the data
Cooper & Herskovits’ K2: 1. by minimizing the length of a joint
Lam and Bacchus MDL: 1, 2 (partial), 3 description of model and data given the
(partial). model.
Wallace, Neil, Korb MML: 1, 2, 3.

Lam & Bacchus (1993)

MDL encoding of causal models:
Lam & Bacchus
Network:
n
i=1 [ki log(n) + d(si 1) j 2(i) sj ]
Search algorithm:
– ki log(n) for specifying ki parents for ith Initial constraints taken from domain expert:
node partial variable order, direct connections
ki
– d(si 1) j=1 sj for specifying the CPT: Greedy search: every possible arc addition is
d is the fixed bit-length per probability tested, best MDL measure used to add one
si is the number of states for node i (Note: no arcs are deleted)
Data given network:
Local arcs checked for improved MDL via arc
n n
N i=1 M (Xi ; (i))N i=1 H (Xi ) reversal
– M (Xi; (i)) is mutual information btw Xi Iterate until MDL fails to improve
and its parent set
– H (Xi ) is entropy of variable Xi )Results similar to K2, but without full variable
ordering
(NB: This code is not efficient. E.g., treats every
node as equally likely to be a parent; assumes
knowledge of all ki .)
MML
MML Metric for Linear Models
Minimum Message Length (Wallace & Boulton
1968) uses Shannon’s measure of information: Network:
I (m) = log P (m) log n! + n(n2 1) log E

Applied in reverse, we can compute P (h; e) from – log n! for variable order
I (h; e). – n(n 1)
for connectivity
2
Given an efficient joint encoding method for the – log E restore efficiency by subtracting
hypothesis & evidence space (i.e., satisfying cost of selecting a linear extension
Shannon’s law), MML:
Parameters given dag h:
Searches fhi g for that hypothesis h that
minimizes I (h) + I (ejh).
f (j jh)
log p
Xj F (j )
Equivalent to that h that maximizes P (h)P (ejh)
where j are the parameters for Xj and F (j )
— i.e., P (hje).
is the Fisher information. f (j jh) is assumed
The other significant difference from to be N (0; j ).
MDL: MML takes parameter estimation (Cf. with MDL’s fixed length for parms)
seriously.

MML Metric for discrete

models
MML Metric for Linear Models We can use PCH (hi ; e) (from Cooper &
Herskovits) to define an MML metric for discrete
models.
Sample for Xj given h and j :
Difference between MML and Bayesian metrics:
K
log P (ejh; j ) = k=1 p1 e =
2
jk 2 j2
2j MML partitions the parameter space and
selects optimal parameters.
where K is the number of sample values and
jk is the difference between the observed Equivalent to a penalty of 12 log e
6
per parameter
value of Xj and its linear prediction. (Wallace & Freeman 1987); hence:
I (e; hi) = j2sj log e

6 log PCH (hi ; e) (1)
Applied in MML Sampling algorithm.
MML search algorithms

MML metrics need to be combined with search.
This has been done three ways:
MML Sampling
1. Wallace, Korb, Dai (1996): greedy search
(linear). Search space of totally ordered models (TOMs).
Brute force computation of linear

Sampled via a Metropolis algorithm (Metropolis
et al. 1953).
extensions (small models only).
From current model M , find the next model M 0
2. Neil and Korb (1999): genetic algorithms by:
(linear).
Asymptotic estimator of linear extensions Randomly select a variable; attempt to swap
order with its predecessor.
GA chromosomes = causal models
Genetic operators manipulate them Or, randomly select a pair; attempt to
Selection pressure is based on MML
add/delete an arc.
3. Wallace and Korb (1999): MML sampling Attempts succeed whenever P (M 0 )=P (M ) > U
(linear, discrete). (per MML metric), where U is uniformly random
Stochastic sampling through space of from [0 : 1].
totally ordered causal models
No counting of linear extensions required

Empirical Results
MML Sampling A weakness in this area — and AI generally.
Metropolis: this procedure samples TOMs with a Paper publications based upon very small
frequency proportional to their posterior models, loose comparisons.
probability.
ALARM net often used — everything sets it
To find posterior of dag h: keep count of visits to to within 1 or 2 arcs.
all TOMs consistent with h
Neil and Korb (1999) compared MML and BGe
Estimated by counting visits to all TOMs (Heckerman & Geiger’s Bayesian metric over
with identical max likelihoods to h equivalence classes), using identical GA search
over linear models:
Output: Probabilities of
On KL distance and topological distance from
Top dags
the true model, MML and BGe performed
Top statistical equivalence classes nearly the same.
Top MML equivalence classes On test prediction accuracy on strict effect

nodes (those with no children), MML clearly
outperformed BGe.
Current Research Issues
size and complexity
difficulties with elicitation
combinations of discrete and continuous (i.e.

mixing node types) (Other) Limitations
Learning issues
– Missing data
inappropriate problems (deterministic
systems, legal rules)
– Latent variables
– Experimental data
– Learning CPT structure
– Multi-structure models
continuous & discrete
CPTs w/ & w/o parm independence

R. Neapolitan (1990) Probabilistic Reasoning in Expert

References Systems. Wiley.
C HAPTERS 1, 2 AND 4 COVER SOME OF THE RELEVANT
HISTORY.
Introduction to Bayesian AI
J. Pearl (1988) Probabilistic Reasoning in Intelligent
T. Bayes (1764) “An Essay Towards Solving a Problem in the Systems, Morgan Kaufmann.
Doctrine of Chances.” Phil Trans of the Royal Soc of
London. Reprinted in Biometrika, 45 (1958), 296-315.
F. P. Ramsey (1931) “Truth and Probability” in The
B. Buchanan and E. Shortliffe (eds.) (1984) Rule-Based Foundations of Mathematics and Other Essays. NY:
Expert Systems: The MYCIN Experiments of the Stanford Humanities Press.
Heuristic Programming Project. Addison-Wesley. T HE ORIGIN OF MODERN BAYESIANISM . I NCLUDES
LOTTERY- BASED ELICITATION AND D UTCH - BOOK
B. de Finetti (1964) “Foresight: Its Logical Laws, Its ARGUMENTS FOR THE USE OF PROBABILITIES.
Subjective Sources,” in Kyburg and Smokler (eds.)
Studies in Subjective Probability. NY: Wiley. R. Reiter (1980) “A logic for default reasoning,” Artificial
Intelligence, 13, 81-132.
D. Heckerman (1986) “Probabilistic Interpretations for
MYCIN’s Certainty Factors,” in L.N. Kanal and J.F. J. von Neumann and O.Morgenstern (1947) Theory of Games
Lemmer (eds.) Uncertainty in Artificial Intelligence. and Economic Behavior, 2nd ed. Princeton Univ.
North-Holland. S TANDARD REFERENCE ON ELICITING UTILITIES VIA
LOTTERIES.
C. Howson and P. Urbach (1993) Scientific Reasoning: The
Bayesian Approach. Open Court.
A MODERN REVIEW OF BAYESIAN THEORY. Bayesian Networks
K.B. Korb (1995) “Inductive learning and defeasible E. Charniak (1991) “Bayesian Networks Without Tears”,
inference,” Jrn for Experimental and Theoretical AI, 7, Artificial Intelligence Magazine, pp. 50-63, Vol 12.
291-324. A N ELEMENTARY INTRODUCTION.
D. D’Ambrosio (1999) “Inference in Bayesian Networks”. Applications

Artificial Intelligence Magazine, Vol 20, No. 2.
D.W. Albrecht, I. Zukerman and Nicholson, A.E. (1998)
P. Haddaway (1999) “An Overview of Some Recent
Bayesian Models for Keyhole Plan Recognition in an
Developments in Bayesian Problem-Solving Techniques”.
Adventure Game. User Modeling and User-Adapted
Artificial Intelligence Magazine, Vol 20, No. 2.
Interaction, 8(1-2), 5-47, Kluwer Academic Publishers.
Howard & Matheson (1981) “Influence Diagrams,” Principles
S. Andreassen, F.V. Jensen, S.K. Andersen, B. Falck, U.
and Applications of Decision Analysis.
Kjærulff, M. Woldbye, A.R. Sørensen, A. Rosenfalck and
F. V. Jensen (1996) An Introduction to Bayesian Networks, F. Jensen (1989) “MUNIN — An Expert EMG Assistant”,
Springer. Computer-Aided Electromyography and Expert Systems,
Chapter 21, J.E. Desmedt (Ed.), Elsevier.
R. Neapolitan (1990) Probabilistic Reasoning in Expert
Systems. Wiley. S.A. Andreassen, J.J Benn, R. Hovorks, K.G. Olesen and
S IMILAR COVERAGE TO THAT OF P EARL ; MORE R.E. Carson (1991) “A Probabilistic Approach to Glucose
EMPHASIS ON PRACTICAL ALGORITHMS FOR NETWORK Prediction and Insulin Dose Adjustment: Description of
UPDATING. Metabolic Model and Pilot Evaluation Study”.
J. Pearl (1988) Probabilistic Reasoning in Intelligent I. Beinlich, H. Suermondt, R. Chavez and G. Cooper (1992)
Systems, Morgan Kaufmann. “The ALARM monitoring system: A case study with two
T HIS IS THE CLASSIC TEXT INTRODUCING BAYESIAN probabilistic inference techniques for belief networks”,
NETWORKS TO THE AI COMMUNITY. Proc. of the 2nd European Conf. on Artificial Intelligence
in medicine, pp. 689-693.
Poole, D., Mackworth, A., and Goebel, R. (1998)
Computational Intelligence: a logical approach. Oxford T.L Dean and M.P. Wellman (1991) Planning and control,
University Press. Morgan Kaufman.
Russell & Norvig (1995) Artificial Intelligence: A Modern T.L. Dean, J. Allen and J. Aloimonos (1994) Artificial
Approach, Prentice Hall. Intelligence: Theory and Practice, Benjamin/Cummings.

P. Dagum, A. Galper and E. Horvitz (1992) “Dynamic M. Shwe and G. Cooper (1990) “An Empirical Analysis of
Network Models for Forecasting”, Proceedings of the 8th Likelihood-Weighting Simulation on a Large, Multiply
Conference on Uncertainty in Artificial Intelligence, pp. Connected Belief Network”, Proceedings of the Sixth
41-48. Workshop on Uncertainty in Artificial Intelligence, pp.
498-508, 1990.
J. Forbes, T. Huang, K. Kanazawa and S. Russell (1995) “The
BATmobile: Towards a Bayesian Automated Taxi”, L.C. van der Gaag, S. Renooij, C.L.M. Witteman, B.M.P.
Proceedings of the 14th Int. Joint Conf. on Artificial Aleman, B.G. “Tall (1999) How to Elicit Many
Intelligence (IJCAI’95), pp. 1878-1885. Probabilities”, Laskey & Prade (eds) UAI99, 647-654.
S.L Lauritzen and D.J. Spiegelhalter (1988) “Local Zukerman, I., McConachy, R., Korb, K. and Pickett, D. (1999)
Computations with Probabilities on Graphical “Exploratory Interaction with a Bayesian Argumentation
Structures and their Application to Expert Systems”, System,” in IJCAI-99 Proceedings – the Sixteenth
Journal of the Royal Statistical Society, 50(2), pp. International Joint Conference on Artificial Intelligence,
157-224. pp. 1294-1299, Stockholm, Sweden, Morgan Kaufmann.
McConachy et al (1999)
A.E. Nicholson (1999) “CSE2309/3309 Artificial Intelligence, Learning Bayesian Networks

Monash University, Lecture Notes”,
http://www.csse.monash.edu.au/ãnnn/2-3309.html. H. Blalock (1964) Causal Inference in Nonexperimental
Research. University of North Carolina.
M. Pradham, G. Provan, B. Middleton and M. Henrion
(1994) “Knowledge engineering for large belief R. Bouckeart (1994) Probabilistic network construction using
networks”, Proceedings of the 10th Conference on the minimum description length principle. Technical
Uncertainty in Artificial Intelligence. Report RUU-CS-94-27, Dept of Computer Science,
Utrecht University.
D. Pynadeth and M. P. Wellman (1995) “Accounting for
Context in Plan Recogniition, with Application to Traffic C. Boutillier, N. Friedman, M. Goldszmidt, D. Koller (1996)
Monitoring”, Proceedings of the 11th Conference on “Context-specific independence in Bayesian networks,” in
Uncertainty in Artificial Intelligence, pp.472-481. Horvitz & Jensen (eds.) UAI 1996, 115-123.
G. Brightwell and P. Winkler (1990) Counting linear H. Dai, K.B. Korb, C.S. Wallace and X. Wu (1997) “A study of
extensions is #P-complete. Technical Report DIMACS casual discovery with weak links and small samples.”
90-49, Dept of Computer Science, Rutgers Univ. Proceedings of the Fifteenth International Joint
Conference on Artificial Intelligence (IJCAI),
W. Buntine (1991) “Theory refinement on Bayesian pp. 1304-1309. Morgan Kaufmann.
networks,” in D’Ambrosio, Smets and Bonissone (eds.)
UAI 1991, 52-69. N. Friedman (1997) “The Bayesian Structural EM
Algorithm,” in D. Geiger and P.P. Shenoy (eds.)
W. Buntine (1996) “A Guide to the Literature on Learning Proceedings of the Thirteenth Conference on Uncertainty
Probabilistic Networks from Data,” IEEE Transactions in Artificial Intelligence (pp. 129-138). San Francisco:
on Knowledge and Data Engineering,8, 195-210. Morgan Kaufmann.
D.M. Chickering (1995) “A Tranformational Geiger and Heckerman (1994) “Learning Gaussian
Characterization of Equivalent Bayesian Network networks,” in Lopes de Mantras and Poole (eds.) UAI
Structures,” in P. Besnard and S. Hanks (eds.) 1994, 235-243.
Proceedings of the Eleventh Conference on Uncertainty in
D. Heckerman and D. Geiger (1995) “Learning Bayesian
Artificial Intelligence (pp. 87-98). San Francisco: Morgan
networks: A unification for discrete and Gaussian
Kaufmann. domains,” in Besnard and Hankds (eds.) UAI 1995,
STATISTICAL EQUIVALENCE .
274-284.
G.F. Cooper and E. Herskovits (1991) “A Bayesian Method D. Heckerman, D. Geiger, and D.M. Chickering (1995)
for Constructing Bayesian Belief Networks from “Learning Bayesian Networks: The Combination of
Databases,” in D’Ambrosio, Smets and Bonissone (eds.) Knowledge and Statistical Data,” Machine Learning, 20,
UAI 1991, 86-94. 197-243.
BAYESIAN LEARNING OF STATISTICAL EQUIVALENCE
G.F. Cooper and E. Herskovits (1992) “A Bayesian Method
CLASSES.
for the Induction of Probabilistic Networks from Data,”
Machine Learning, 9, 309-347. K. Korb (1999) “Probabilistic Causal Structure” in H.
A N EARLY BAYESIAN CAUSAL DISCOVERY METHOD. Sankey (ed.) Causation and Laws of Nature:

Australasian Studies in History and Philosophy of Structure Priors,” in N. Zhong and L. Zhous (eds.)
Science 14. Kluwer Academic. Methodologies for Knowledge Discovery and Data
I NTRODUCTION TO THE RELEVANT PHILOSOPHY OF Mining: Third Pacific-Asia Conference (pp. 432-437).
CAUSATION FOR LEARNING BAYESIAN NETWORKS. Springer Verlag.
G ENETIC ALGORITHMS FOR CAUSAL DISCOVERY;
P. Krause (1998) Learning Probabilistic Networks.
STRUCTURE PRIORS.
http : ==www:auai:org=bayes USKrause:ps:gz
BASIC INTRODUCTION TO BN S, PARAMETERIZATION J.R. Neil, C.S. Wallace and K.B. Korb (1999) “Learning
AND LEARNING CAUSAL STRUCTURE . Bayesian networks with restricted causal interactions,”
W. Lam and F. Bacchus (1993) “Learning Bayesian belief in Laskey and Prade (eds.) UAI 99, 486-493.
networks: An approach based on the MDL principle,” Jrn
J. Rissanen (1978) “Modeling by shortest data description,”
Comp Intelligence, 10, 269-293.
Automatica, 14, 465-471.
D. Madigan, S.A. Andersson, M.D. Perlman & C.T. Volinsky
H. Simon (1954) “Spurious Correlation: A Causal
(1996) “Bayesian model averaging and model selection
Interpretation,” Jrn Amer Stat Assoc, 49, 467-479.
for Markov equivalence classes of acyclic digraphs,”
Comm in Statistics: Theory and Methods, 25, 2493-2519. D. Spiegelhalter & S. Lauritzen (1990) “Sequential Updating
D. Madigan and A. E. Raftery (1994) “Model selection and of Conditional Probabilities on Directed Graphical
accounting for model uncertainty in graphical modesl Structures,” Networks, 20, 579-605.
using Occam’s window,” Jrn AMer Stat Assoc, 89,
P. Spirtes, C. Glymour and R. Scheines (1990) “Causality
1535-1546.
from Probability,” in J.E. Tiles, G.T. McKee and G.C.
N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Dean Evolving Knowledge in Natural Science and
Teller and E. Teller (1953) “Equations of state Artificial Intelligence. London: Pitman. A N
calculations by fast computing machines,” Jrn Chemical ELEMENTARY INTRODUCTION TO STRUCTURE
Physics, 21, 1087-1091. LEARNING VIA CONDITIONAL INDEPENDENCE .
J.R. Neil and K.B. Korb (1999) “The Evolution of Causal P. Spirtes, C. Glymour and R. Scheines (1993) Causation,
Models: A Comparison of Bayesian Metrics and Prediction and Search: Lecture Notes in Statistics 81.
Springer Verlag.
A THOROUGH PRESENTATION OF THE ORTHODOX
STATISTICAL APPROACH TO LEARNING CAUSAL
STRUCTURE .
C. S. Wallace, K. B. Korb, and H. Dai (1996) “Causal
J. Suzuki (1996) “Learning Bayesian Belief Networks Based Discovery via MML,” in L. Saitta (ed.) Proceedings of the
on the Minimum Description Length Principle,” in L. Thirteenth International Conference on Machine
Saitta (ed.) Proceedings of the Thirteenth International Learning (pp. 516-524). San Francisco: Morgan
Conference on Machine Learning (pp. 462-470). San Kaufmann.
Francisco: Morgan Kaufmann. I NTRODUCES AN MML METRIC FOR CAUSAL MODELS.
T.S. Verma and J. Pearl (1991) “Equivalence and Synthesis S. Wright (1921) “Correlation and Causation,” Jrn
of Causal Models,” in P. Bonissone, M. Henrion, L. Kanal Agricultural Research, 20, 557-585.
and J.F. Lemmer (eds) Uncertainty in Artificial
Intelligence 6 (pp. 255-268). Elsevier. S. Wright (1934) “The Method of Path Coefficients,” Annals
T HE GRAPHICAL CRITERION FOR STATISTICAL of Mathematical Statistics, 5, 161-215.
EQUIVALENCE .
C.S. Wallace and D. Boulton (1968) “An information measure Current Research
for classification,” Computer Jrn, 11, 185-194.
C.S. Wallace and P.R. Freeman (1987) “Estimation and

inference by compact coding,” Jrn Royal Stat Soc (Series
B), 49, 240-252.
Bayesian Network URL’s
C. S. Wallace and K. B. Korb (1999) “Learning Linear Causal
Models by MML Sampling,” in A. Gammerman (ed.)
Causal Models and Intelligent Data Management.
Springer Verlag.
S AMPLING APPROACH TO LEARNING CAUSAL MODELS ;
DISCUSSION OF STRUCTURE PRIORS.

Ai99 Tutorial 4

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Ai99 Tutorial 4

Enviado por

Direitos autorais:

Formatos disponíveis

Nicholson & Korb 1 Nicholson & Korb 2

School of Computer Science Break (10 min)

fannn,korbg@csse.monash.edu.au 6. Bayesian Net Lab (60 min: Optional)

Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb 3 Nicholson & Korb 4

Introduction to Bayesian AI Reasoning under uncertainty

 Reasoning under uncertainty

Bayesian AI Tutorial Bayesian AI Tutorial

Probabilities Designed to cope with vagueness:

Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb 7 Nicholson & Korb 8

MYCIN’s Certainty Factors

CF (h; e) = MB (h; e) MD(h; e) 2 [ 1; 1]  Best semantics for default rules are

ApplyforJob(me) : ApplyforJob(x) ! Reject(x)

 No semantics ever given for ‘belief’/‘disbelief’ Reject(me)

Bayesian AI Tutorial Bayesian AI Tutorial

Probability Theory A Dutch Book

Payoff table on a bet for h

Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb 11 Nicholson & Korb 12

Bayes’ Theorem; Bayesian Decision Theory

Posterior = Likelihood Prior Action Rain (p = .4) Shine (1 - p = .6)

2. Total evidence: e, and only e, is learned.

Bayesian AI Tutorial Bayesian AI Tutorial

A Bayesian conception of an AI is:

Bayesian data mining

Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb 15 Nicholson & Korb 16

Bayesian AI Tutorial Bayesian AI Tutorial

Earthquake Example: Notes

 Assumptions: John and Mary don’t perceive

Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb 19 Nicholson & Korb 20

Representing the joint

– helpful in understanding how to construct = i P (xi jx1 ^ :::xi 1 )

 Encoding a collection of conditional = i P (xi j(Xi ))

– helpful in understanding how to design Example: P (J ^ M ^ A ^ :B ^ :E )

= 0:9 0:7 0:001 0:999 0:998 = 0:0067:

Bayesian AI Tutorial Bayesian AI Tutorial

Compactness and Node

1. Choose the set of relevant variables Xi that  Compactness of BN is an example of a locally

P (XijXi 1; :::; X1) = P (Xij(Xi))

(c) Define the CPT for Xi . Alarm

Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb 23 Nicholson & Korb 24

2. MaryCalls, JohnCalls, Earthquake, Burglary,

More probabilities than the full joint!  A = Jack’s flu

Bayesian AI Tutorial Bayesian AI Tutorial

Conditional Independence: Conditional Dependence:

Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb 27 Nicholson & Korb 28

 Graph-theoretic criterion of conditional

 We can determine whether a set of nodes X is

 Violating causal order requires new arcs to

Bayesian AI Tutorial Bayesian AI Tutorial

 Basic task for any probabilistic inference

Flu and TB are marginally independent.  Also called Belief Updating.

Diagnostic Causal Mixed

Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb 31 Nicholson & Korb 32

 Causal Inferences: from causes to effects.  Exact inference

P(JohnCalls|Burglary) – Trees and polytrees: message-passing

P(Burglary|Alarm)  Approximate Inference

P(Alarm|JohnCalls ^ :EarthQuake)  In the general case, both sorts of inference

Reasoning under uncertainty

CF (h; e) = MB (h; e) MD(h; e) 2 [ 1; 1] Best semantics for default rules are

No semantics ever given for ‘belief’/‘disbelief’ Reject(me)

Assumptions: John and Mary don’t perceive

Encoding a collection of conditional = i P (xi j(Xi ))

1. Choose the set of relevant variables Xi that Compactness of BN is an example of a locally

P (XijXi 1; :::; X1) = P (Xij(Xi))

More probabilities than the full joint! A = Jack’s flu

Graph-theoretic criterion of conditional

We can determine whether a set of nodes X is

Violating causal order requires new arcs to

Basic task for any probabilistic inference

Flu and TB are marginally independent. Also called Belief Updating.

Causal Inferences: from causes to effects. Exact inference

P(Burglary|Alarm) Approximate Inference

P(Alarm|JohnCalls ^ :EarthQuake) In the general case, both sorts of inference

Usually converges to close to exact solution – Utility theory

its possible actions Decision nodes: (rectangles) represent points

Can calculate distributions for St+1 and

Can be done using standard BN updating

Usually only keep two time slices of the

Uses of Bayesian Networks Bayes’ rule allows unknown probabilities to

1. Calculating the belief in query variables Conditional independence (due to causal

2. Predicting values in dependent variables Bayesian networks are a natural way to

The speed of inference in practice depends on

(Simple) Example Networks

Bayesian networks can be extended with – Medical Decision Making: Survey of

Pathfinder case study: see handout using

QMR (Quick Medical Reference): 600

MUNIN (Andreassen et al., 1989):

Glucose prediction and insulin dose

CPSC project (Pradham et al., 1994)

Application of LW to medical diagnosis

Traffic plan recognition (Pynadeth&Wellman,

supports attentional modeling

Bayesian networks are a good representation for

Different networks (matrices) for each round.