Probability Review: Thursday Sep 13

Probability Review
Thursday Sep 13
Probability Review
Events and Event spaces
Random variables
Joint probability distributions
Marginalization, conditioning, chain
rule, Bayes Rule, law of total
probability, etc.
Structural properties
Independence, conditional
independence
Mean and Variance
The big picture
Sample space and Events

Sample Space, result of an
experiment
If you toss a coin twice
Event: a subset of
First toss is head = {HH,HT}
S: event space, a set of events
Closed under finite union and
complements
Entails other binary operation: union,
diff, etc.
Probability Measure
Defined over (Ss.t.
P() >= 0 for all in S
P() = 1
If are disjoint, then
P( U ) = p() + p()
We can deduce other axioms from the
above ones
Ex: P( U ) for non-disjoint event

P( U ) = p() + p() p(
Visualization
We can go on and define conditional

probability, using the above
visualization
Conditional Probability
P(F|H) = Fraction of worlds in which H is true
that also have F true
p( F H )
p ( f | h)
p( H )
Rule of total probability

B5
B2
B3
B4
A
B7
B6
B1
p A P Bi P A | Bi
From Events to Random

Variable
Almost all the semester we will be dealing
with RV
Concise way of specifying attributes of
outcomes
Modeling students (Grade and Intelligence):
all possible students
What are events
Grade_A = all students with grade A
Grade_B = all students with grade B
Intelligence_High = with high
intelligence
Very cumbersome
We need functions that maps from to
Random Variables
I:Intelligence
High
low
A
G:Grade
A
+
P(I = high) = P( {all students whose intelligence is high})
Discrete Random Variables

Random variables (RVs) which may
take on only a countable number of
distinct values
E.g. the total number of tails X you get if
you flip 100 coins
X is a RV with arity k if it can take on

exactly one value out of {x1, , xk}
E.g. the possible values that X can take
on are 0, 1, 2, , 100
Probability of Discrete RV
Probability mass function (pmf): P(X
= xi)
Easy facts about pmf
i P(X = xi) = 1
P(X = xiX = xj) = 0 if i j
P(X = xi U X = xj) = P(X = xi) + P(X = xj)
if i j
P(X = x1 U X = x2 U U X = xk) = 1
Common Distributions
Uniform X U[1, , N]
X takes values 1, 2, N
P(X = i) = 1/N
E.g. picking balls of different colors from a
box
Binomial X Bin(n, p)
i
X takes nvalues
n i0, 1, , n
p(X i) p (1 p)
i
E.g. coin flips
Continuous Random
Variables
Probability density function (pdf)
instead of probability mass function
(pmf)
A pdf is any function f(x) that
describes the probability density in
terms of the input variable x.
Probability of Continuous RV
Properties of pdf
f (x) 0,x
f (x) 1
Actual probability can be obtained by

taking the integral of pdf
E.g. the probability of X being between 0
and 1 is 1
P(0 X 1)
f (x)dx
0
Cumulative Distribution
Function
FX(v) = P(X v)
Discrete RVs
FX(v) = vi P(X = vi)
Continuous
RVs
v
FX (v)
f (x)dx
Fx (x) f (x)
dx
Common Distributions
Normal X
N(, 2)
(x ) 2
1
f (x)
exp
2
2
2
E.g. the height of the entire population
Multivariate Normal
Generalization to higher dimensions
of the one-dimensional normal
Covariance matrix
f Xr (x i ,..., x d )
1
(2 )
d /2
1/ 2
T
r
1 r
. exp 1 x
x
2
Mean
Probability Review
Random variables
probability, etc.
independence
Mean and Variance
The big picture
Joint Probability Distribution

Random variables encodes attributes
Not all possible combination of attributes
are equally likely
Joint probability distributions quantify
this
P( X= x, Y= y) = P(x, y)
Generalizes
Y N-RVs
y 1
x y P X x,to
f
x
,
y
dxdy
1
X
,
Y
x y
Chain Rule
Always true
P(x, y, z) = p(x) p(y|x) p(z|x, y)
= p(z) p(y|z) p(x|y, z)
=
Conditional Probability
events
P X x Y y
P X x Y y
But we will always write it this way:
p ( x, y )
P x | y
p( y )
P Y y
Marginalization
We know p(X, Y), what is P(X=x)?
We can use the low of total probability,
why?
p x P x, y
y
P y P x | y
B5
B2
B3
B4
A
B7
B6
B1
Marginalization Cont.
Another example
p x P x, y , z
y,z
P y, z P x | y, z
z,y
Bayes Rule
We know that P(rain) = 0.5
If we also know that the grass is
wet, then how this affects our
belief about whether it rains or
not?
P(rain)P(wet | rain)
P rain | wet
P(wet)
P(x)P(y | x)
P x | y
P(y)
Bayes Rule cont.

You can condition on more variables
P ( x | z ) P ( y | x, z )
P x | y, z
P( y | z )
Probability Review
Random variables
probability, etc.
independence
Mean and Variance
The big picture
Independence
X is independent of Y means that
knowing Y does not change our belief
about X.
P(X|Y=y) = P(X)
P(X=x, Y=y) = P(X=x) P(Y=y)
The above should hold for all x, y
It is symmetric and written as X
Y
Independence
X1, , Xn are independent if and only if
n
P(X1 A1,..., X n An ) P X i Ai
i1
If X1, , Xn are independent and

identically distributed we say they are
iid (or that they are a random sample)
and we write
X1, , Xn P
CI: Conditional
Independence
RV are rarely independent but we

can still leverage local structural
properties like Conditional
Independence.
X Y | Z if once Z is observed,
knowing the value of Y does not
change our belief about X
P(rain sprinklers on | cloudy)
P(rain sprinklers on | wet grass)
Conditional Independence
P(X=x | Z=z, Y=y) = P(X=x | Z=z)

P(Y=y | Z=z, X=x) = P(Y=y | Z=z)
P(X=x, Y=y | Z=z) = P(X=x| Z=z)
P(Y=y| Z=z)
We call these factors : very useful conce
Probability Review
Random variables
probability, etc.
independence
Mean and Variance
The big picture
Mean and Variance

E X
Mean (Expectation):
E X vi P X vi
Discrete RVs:
vi
E(g(X)) v g(v i )P(X v i )

i
E X
Continuous RVs:
E(g(X))
xf x dx
g(x) f (x)dx
Mean and Variance

Variance:
Discrete RVs:
Var(X) E((X ) )
2
Var(X) E(X 2 ) 2
Continuous RVs:
Covariance:
Covariance:
Cov(X,Y )
V X
V X
vi P X vi
2
vi
f x dx
E((X x )(Y y )) E(XY ) x y
Mean and Variance

Correlation:
(X,Y ) Cov(X,Y ) / x y
1 (X,Y ) 1
Properties
Mean
E X Y E X E Y
E aX aE X
E XY E X E Y
If X and Y are independent,
Variance
V aX b a 2V X
If X and Y are independent,
V X Y V (X) V (Y)
Some more properties

The conditional expectation of Y
given X when the value of X = x is:
E Y | X x y * p ( y | x)dy
The Law of Total Expectation or Law

of Iterated Expectation:
E (Y ) E E (Y | X ) E (Y | X x) p X ( x)dx
Some more properties

The law of Total Variance:
Var(Y ) VarE(Y | X) E Var(Y | X)
Probability Review
Random variables
probability, etc.
independence
Mean and Variance
The big picture
The Big Picture

Probability
Mode
l
Data
Estimation/learning
Statistical Inference
Given observations from a model
What (conditional) independence
assumptions hold?
Structure learning
If you know the family of the model (ex,

multinomial), What are the value of the
parameters: MLE, Bayesian estimation.
Parameter learning
Probability Review
Random variables
probability, etc.
independence
Mean and Variance
The big picture
Monty Hall Problem

You're given the choice of three doors:
Behind one door is a car; behind the
others, goats.
You pick a door, say No. 1
The host, who knows what's behind the
doors, opens another door, say No. 3,
which has a goat.
Do you want to pick door No. 2 instead?
Host reveals
Goat A
or
Host reveals
Goat B
Host must
reveal Goat B
Host must
reveal Goat A
Monty Hall Problem: Bayes

Rule
Ci : the car is behind door i, i = 1, 2,
3P Ci 1 3
H ij
: the host opens door j after you

0
i
pick door i
P H ij Ck
0
jk
ik
12
1 i k , j k
Monty Hall Problem: Bayes Rule

cont.
WLOG, i=1, j=3
P C1 H13
P H13
P H13 C1 P C 1
P H13
1 1 1
C1 P C1
2 3 6

cont.
P H13 P H13 , C1 P H13 , C2 P H13 , C3
P H13 C1 P C1 P H13 C2 P C2
1
1
1
6
3
1
2
16 1
P C1 H13 1 2 3

cont.
16 1
12 3
1 2
1 P C1 H13
3 3
P C1 H13
P C2 H13
You should switch!
Information Theory
P(X) encodes our uncertainty about X
Some variables are more uncertain that others
P(X
)
P(Y
)
How can we quantify this intuition?

Entropy: average number of bits required to
encode X
1
1
H P X E log
P
x
log
P x log P( x)

p
x
P
x
x
x
Information Theory cont.

Entropy: average number of bits required to
encode X
1
1
H P X E log
P
x
log
P x log P( x)

p
x
P
x
x
x
We can define conditional
1 entropy similarly
H P X | Y E log
H P X ,Y H P Y
p x | y
i.e. once Y is known, we only need H(X,Y)

H(Y) bits
H P X ,Y , Z H P X H P Y | X H P Z | X ,Y
We can also
define chain rule for entropies (not
surprising)
Mutual Information: MI
Remember independence?
If XY then knowing Y wont change our belief
about X
Mutual information can help quantify this! (not
the
way
though)
I P Xonly
;Y H
P X HP X |Y
MI:
The amount of uncertainty in X which is
removed by knowing Y
Symmetric
I(X;Y) = 0 iff, X and Y are independent!
p ( x, y )
p( x) p( y )
I ( X ; Y ) p( x, y ) log
y
Chi Square Test for

Independence
(Example)
Independe
Republican Democrat
Total
nt
Male
200
150
50
400
Female
250
300
50
600
Total
450
450
100
1000
State the hypotheses

H0: Gender and voting preferences are independent.
Ha: Gender and voting preferences are not independent
Choose significance level
Say, 0.05
Chi Square Test for

Independence
Analyze sample data
Republi
can
Democr
at
Independ
ent
Total
Male
200
150
50
400
Fema
le
250
300
50
600
Total
450
100
1000
Degrees of freedom =
|g|-1 * |v|-1 = (2-1) * (3-1) = 2
Expected frequency count =
450
Eg,v = (ng * nv) / n

Em,r = (400 * 450) / 1000 = 180000/1000 = 180
Em,d= (400 * 450) / 1000 = 180000/1000 = 180
Em,i = (400 * 100) / 1000 = 40000/1000 = 40
Ef,r = (600 * 450) / 1000 = 270000/1000 = 270
Ef,d = (600 * 450) / 1000 = 270000/1000 = 270
Ef,i = (600 * 100) / 1000 = 60000/1000 = 60
Chi Square Test for

Independence
Republi
can
Chi-square test statistic

X
(Og ,v E g ,v ) 2
E g ,v
Democr
at
Independ
ent
Total
Male
200
150
50
400
Fema
le
250
300
50
600
Total
450
450
100
1000
2 = (200 - 180)2/180 + (150 - 180)2/180 + (50 40)2/40 +

(250 - 270)2/270 + (300 - 270)2/270 + (50 - 60)2/40
2 = 400/180 + 900/180 + 100/40 + 400/270 +
900/270 +
100/60
2 = 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2
Chi Square Test for

Independence
P-value
Probability of observing a sample statistic
as extreme as the test statistic
P(X2 16.2) = 0.0003
Since P-value (0.0003) is less than

the significance level (0.05), we
cannot accept the null hypothesis
There is a relationship between gender
and voting preference
Acknowledgment
Carlos Guestrin recitation slides:

http://www.cs.cmu.edu/~guestrin/Class/10708/recitations/r1/Proba
bility_and_Statistics_Review.ppt
Andrew Moore Tutorial:
http://www.autonlab.org/tutorials/prob.html
Monty hall problem:
http://en.wikipedia.org/wiki/Monty_Hall_problem
http://www.cs.cmu.edu/~guestrin/Class/10701F07/recitation_schedule.html
Chi-square test for independence
http://stattrek.com/chi-square-test/independence.aspx

Probability Review: Thursday Sep 13

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Probability Review: Thursday Sep 13

Enviado por

Direitos autorais:

Formatos disponíveis

Probability Review

Sample space and Events

Ex: P( U ) for non-disjoint event

We can go on and define conditional

Rule of total probability

From Events to Random

P(I = high) = P( {all students whose intelligence is high})

Discrete Random Variables

X is a RV with arity k if it can take on

E.g. coin flips

Actual probability can be obtained by

E.g. the height of the entire population

Joint Probability Distribution

But we will always write it this way:

Bayes Rule cont.

If X1, , Xn are independent and

RV are rarely independent but we

P(X=x | Z=z, Y=y) = P(X=x | Z=z)

Mean and Variance

E(g(X)) v g(v i )P(X v i )

Mean and Variance

E((X x )(Y y )) E(XY ) x y

Mean and Variance

Some more properties

The Law of Total Expectation or Law

Some more properties

The Big Picture

If you know the family of the model (ex,

Monty Hall Problem

Monty Hall Problem: Bayes

: the host opens door j after you

Monty Hall Problem: Bayes Rule

Monty Hall Problem: Bayes Rule

Monty Hall Problem: Bayes Rule

You should switch!

How can we quantify this intuition?

Information Theory cont.

We can define conditional

i.e. once Y is known, we only need H(X,Y)

Chi Square Test for

State the hypotheses

Chi Square Test for

Eg,v = (ng * nv) / n

Chi Square Test for

Chi-square test statistic

2 = (200 - 180)2/180 + (150 - 180)2/180 + (50 40)2/40 +

Chi Square Test for

Since P-value (0.0003) is less than

Carlos Guestrin recitation slides:

Você também pode gostar