Escolar Documentos
Profissional Documentos
Cultura Documentos
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Outlines
Overview
Linear Regression
Linear Classification
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning
1of 183
Overview
Definition
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Outlines
Related Fields
Overview
Linear Regression
Linear Classification
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning
2of 183
Linear Regression
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Outlines
10
11
Bayesian Regression
12
13
Predictive Distribution
14
Overview
Linear Regression
Linear Classification
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning
3of 183
Linear Classification
15
Classification
16
17
18
Decision Theory
19
20
21
22
Discrete Features
23
Logistic Regression
24
Feature Space
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Outlines
Overview
Linear Regression
Linear Classification
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning
4of 183
Neural Networks
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Outlines
25
Neural Networks
Overview
Linear Regression
Linear Classification
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning
26
Parameter Optimisation
5of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Outlines
27
Kernel Methods
Overview
Linear Regression
Linear Classification
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning
28
6of 183
29
K-means Clustering
30
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Outlines
Overview
Linear Regression
Linear Classification
31
32
33
Convergence of EM
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning
7of 183
Resources
34
35
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Outlines
Overview
Linear Regression
Linear Classification
36
Rejection Sampling
37
Importance Sampling
38
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning
8of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Outlines
Overview
Linear Regression
Linear Classification
39
Neural Networks
Kernel Methods and SVM
Mixture Models and EM
Resources
More Machine Learning
9of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Part I
MLSS
2010
What is Machine
Learning?
Overview
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
10of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
11of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
13of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
15of 183
Backgammon
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
16of 183
Introduction to Statistical
Machine Learning
Image Denoising
Original image
c 2010
Christfried Webers
NICTA
The Australian National
University
Noise added
Denoised
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
17of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Audio Sources
Microphones
Audio Mixtures
What is Machine
Learning?
Definition
1.0
0.5
0.5
10
20
30
40
50
10
20
30
40
50
Examples of Machine
Learning
0.5
0.5
Related Fields
1.0
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
1.0
1.0
0.5
0.5
10
10
0.5
20
30
40
50
20
30
40
50
0.5
1.0
18of 183
autonomous robotics,
detecting credit card fraud,
detecting network intrusion,
bioinformatics,
neuroscience,
medical diagnosis,
stock market analysis,
...
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
19of 183
Related Fields
Artificial Intelligence - AI
Statistics
Game Theory
Neuroscience, Psychology
Data Mining
Computer Science
Adaptive Control Theory
...
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
20of 183
Unsupervised Learning
Association
Clustering
Density Estimation
Blind source
separation
Supervised Learning
Regression
Classification
Reinforcement Learning
Agents
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Others
Active Learning
SemiSupervised
Learning
Transductive
Learning
...
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
21of 183
Unsupervised Learning
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
22of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
Example applications
Clustering customers in
Customer-Relationship-Management
Image compression: color quantisation
23of 183
Supervised Learning
Given pairs of data and targets (=labels).
Learn a mapping from the data to the targets (training).
Goal: Use the learned mapping to correctly predict the
target for new input data.
Need to generalise well from the training data/target pairs.
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
24of 183
Reinforcement Learning
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
25of 183
Introduction to Statistical
Machine Learning
Reinforcement Learning
observation1
reward1
c 2010
Christfried Webers
NICTA
The Australian National
University
observation2
reward2
observationi
receive
reward
receive
reward
rewardi
receive
reward
Agent
Agent
choose action
choose action
choose action
action1
action2
actioni
...
Agent
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
26of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
= {1, 2, 3, 4, 5, 6}
Even = {2, 4, 6}, Odd = {1, 3, 5}
P(3) = 16 , P(Odd) = P(Even) = 21
3
and Odd)
1
P(3 | Odd) = P(3P(Odd)
= 1/6
1/2 = 3
General Axioms
P({}) = 0 P(A) P() = 1,
P(A B) + P(A B) = P(A) + P(B),
P(A B) = P(A | B)P(B).
Rules of Probability
Sum rule: P(X) = Y P(X, Y)
Product rule: P(X, Y) = P(X|Y) P(Y)
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
27of 183
Introduction to Statistical
Machine Learning
Probability Jargon
c 2010
Christfried Webers
NICTA
The Australian National
University
(1
)
(Bayes Rule)
P(1101)
Maximum a Posterior (MAP) estimate = arg max P( | 1101) = 3
P(11011)
P(1101)
2
3
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
2
3
28of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
x = 0, . . . , 1
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
29of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
N = 10
What is Machine
Learning?
x (x1 , . . . , xN )
t (t1 , . . . , tN )
xi R i = 1,. . . , N
ti R i = 1,. . . , N
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
30of 183
M : order of polynomial
y(x, w) = w0 + w1 x + w2 x2 + + wM xM
M
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
wm xm
Introduction to Statistical
Machine Learning
m=0
nonlinear function of x
linear function of the unknown model parameter w
How can we find good parameters w = (w1 , . . . , wM )T ?
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
31of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
tn
MLSS
2010
y(xn , w)
What is Machine
Learning?
Definition
xn
Examples of Machine
Learning
Related Fields
1
2
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
N
2
n=1
(y(xn , w) tn )
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
wm xm
y(x, w) =
m=0
MLSS
2010
M=0
= w0
What is Machine
Learning?
Definition
Examples of Machine
Learning
M =0
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
33of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
wm xm
y(x, w) =
m=0
MLSS
2010
M=1
= w0 + w1 x
What is Machine
Learning?
Definition
Examples of Machine
Learning
M =1
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
34of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
wm xm
y(x, w) =
m=0
= w0 + w1 x +
M=3
w2 x2
MLSS
2010
+ w3 x3
What is Machine
Learning?
Definition
Examples of Machine
Learning
M =3
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
35of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
wm xm
y(x, w) =
m=0
M=9
= w0 + w1 x + + w8 x8 + w9 x9
overfitting
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
M =9
Fundamental Types of
Learning
1
36of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
2E(w )/N
What is Machine
Learning?
Definition
Examples of Machine
Learning
ERMS
Training
Test
Related Fields
Fundamental Types of
Learning
0.5
37of 183
Introduction to Statistical
Machine Learning
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
M=0
0.19
M=1
0.82
-1.27
M=3
0.31
7.99
-25.43
17.37
c 2010
Christfried Webers
NICTA
The Australian National
University
M=9
0.35
232.37
-5321.83
48568.31
-231639.30
640042.26
-1061800.52
1042400.18
-557682.99
125201.43
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
38of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
N = 15
MLSS
2010
What is Machine
Learning?
N = 15
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
39of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
What is Machine
Learning?
Definition
Examples of Machine
Learning
N = 100
1
t
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
1
40of 183
Introduction to Statistical
Machine Learning
Regularisation
c 2010
Christfried Webers
NICTA
The Australian National
University
What is Machine
Learning?
Definition
( y(xn , w) tn ) +
w
2
2
n=1
MLSS
2010
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
41of 183
Introduction to Statistical
Machine Learning
Regularisation
c 2010
Christfried Webers
NICTA
The Australian National
University
M=9
MLSS
2010
What is Machine
Learning?
ln = 18
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
42of 183
Introduction to Statistical
Machine Learning
Regularisation
c 2010
Christfried Webers
NICTA
The Australian National
University
M=9
MLSS
2010
What is Machine
Learning?
ln = 0
Definition
Examples of Machine
Learning
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
43of 183
Introduction to Statistical
Machine Learning
Regularisation
c 2010
Christfried Webers
NICTA
The Australian National
University
M=9
MLSS
2010
1
Training
Test
What is Machine
Learning?
ERMS
Definition
Examples of Machine
Learning
0.5
Related Fields
Fundamental Types of
Learning
Basic Probability Theory
Polynomial Curve Fitting
35
30
ln 25
20
44of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Part II
MLSS
2010
Linear Basis Function
Models
Linear Regression
45of 183
Introduction to Statistical
Machine Learning
MLSS
2010
wj x(j) = wT x
y(x, w) =
j=0
c 2010
Christfried Webers
NICTA
The Australian National
University
2.0
Bayesian Regression
1.5
1.0
20
15
Predictive Distribution
Y 10
Limitations of Linear
Basis Function Models
5
0
1.0
1.5
2.0
X1
2.5
3.0
Loss(w) =
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 3
n=1
Regularized Least
Squares
Bayesian Regression
Predictive Distribution
X2
Limitations of Linear
Basis Function Models
X1
47of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
y(x, w) =
wj j (x) = w (x)
j=0
T
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
48of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
0.5
Limitations of Linear
Basis Function Models
0.5
1
1
1
49of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
(x j )2
2s2
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
0.75
0.5
Limitations of Linear
Basis Function Models
0.25
0
1
50of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
x j
s
1
1 + exp(a)
MLSS
2010
Linear Basis Function
Models
0.75
Limitations of Linear
Basis Function Models
0.5
0.25
0
1
51of 183
Introduction to Statistical
Learning
Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 ChapMachine
5
c 2010
Christfried Webers
NICTA
The Australian National
University
Haar Wavelets
Symmlet-8 Wavelets
6,35
6,15
5,15
5,1
Wavelets : localised in
both space and
frequency
mutually orthogonal to
simplify application.
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
4,9
Regularized Least
Squares
4,4
Bayesian Regression
3,5
3,2
Predictive Distribution
2,3
Limitations of Linear
Basis Function Models
2,1
1,0
0.0
0.2
0.4
0.6
Time
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time
52of 183
Introduction to Statistical
Machine Learning
Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 5
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
53of 183
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
t = y(x, w) +
deterministic
Introduction to Statistical
Machine Learning
noise
Regularized Least
Squares
Bayesian Regression
54of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
n=1
p(t | X, w, ) =
MLSS
2010
Predictive Distribution
Limitations of Linear
Basis Function Models
=
n=1
N (tn | wT (xn ), 1 )
55of 183
ln p(t | X, w, ) =
n=1
ln N (tn | w (xn ),
ln
n=1
N
N
ln ln(2) ED (w)
2
2
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
ED (w) =
1
2
n=1
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
1
2
n=1
0 (x1 ) 1 (x1 )
0 (x2 ) 1 (x2 )
..
= ...
.
0 (xN ) 1 (xN )
1
(t w)T (t w)
2
MLSS
2010
Linear Basis Function
Models
...
...
..
.
...
M1 (x1 )
M1 (x2 )
..
M1 (xN )
= ( )
t= t
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
1 T
w w
2
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
wML = I +
Limitations of Linear
Basis Function Models
58of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
M
q
j=1
|wj |
q = 0.5
q=1
q=2
q=4
59of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Lasso regulariser
1
2
w2j
j=1
w2
MLSS
2010
j=1
|wj |
w2
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
w1
w1
60of 183
Introduction to Statistical
Machine Learning
Bayesian Regression
c 2010
Christfried Webers
NICTA
The Australian National
University
Bayes Theorem
posterior =
likelihood prior
normalisation
p(w | t) =
p(t | w) p(w)
p(t)
n=1
n=1
Regularized Least
Squares
Bayesian Regression
p(t | w) =
MLSS
2010
N (tn | wT (xn ), 1 )
1
= const exp{ (t w)T (t w)}
2
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
62of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Likelihood
Bernoulli
Binomial
Poisson
Multinomial
Conjugate Prior
Beta
Beta
Gamma
Dirichlet
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Likelihood
Uniform
Exponential
Normal
Multivariate normal
Conjugate Prior
Pareto
Gamma
Normal
Multivariate normal
Limitations of Linear
Basis Function Models
63of 183
Introduction to Statistical
Machine Learning
Bayesian Regression
c 2010
Christfried Webers
NICTA
The Australian National
University
likelihood
prior/posterior
p(w)
No data point
(N = 0): start with
prior.
Each posterior acts
as the prior for the
next data/target pair.
Nicely fits a
sequential learning
framework.
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
p(t1 | w, x1 )
Bayes
p(w | t1, x1 )
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
p(t2 | w, x2)
Bayes
64of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
65of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
66of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
67of 183
Introduction to Statistical
Machine Learning
Predictive Distribution
c 2010
Christfried Webers
NICTA
The Australian National
University
p(t, w | x, X, t) dw
p(t | w, x, X, t) p(w | x, X, t) dw
testing only
training only
MLSS
2010
Regularized Least
Squares
(sum rule)
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
p(t | w, x) p(w | X, t) dw
68of 183
noise of data
+ T (I + T )1
uncertainty of w
2
N+1
(x) N2 (x) and limN N2 (x) =
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
p(w | ) = N (w | 0, 1 I)
N2 (x) =
Introduction to Statistical
Machine Learning
69of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
t
0
MLSS
2010
Linear Basis Function
Models
Bayesian Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
70of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
t
0
MLSS
2010
Linear Basis Function
Models
Bayesian Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
Example with artificial sinusoidal data from sin(2x) (green) and added
noise. Samples y(x, w) (red) from the posterior distribution p(w | X, t) .
71of 183
Basis function j (x) are fixed before the training data set is
observed.
Curse of dimensionality : Number of basis function grows
rapidly, often exponentially, with the dimensionality D.
But typical data sets have two nice properties which can
be exploited if the basis functions are not fixed :
Data lie close to a nonlinear manifold with intrinsic
dimension much smaller than D. Need algorithms which
place basis functions only where data are (e.g. radial basis
function networks, support vector machines, relevance
vector machines, neural networks).
Target variables may only depend on a few significant
directions within the data manifold. Need algorithms which
can exploit this property (Neural networks).
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
72of 183
Curse of Dimensionality
Linear Algebra allows us to operate in n-dimensional
vector spaces using the intution from our 3-dimensional
world as a vector space. No surprises as long as n is finite.
If we add more structure to a vector space (e.g. inner
product, metric), our intution gained from the
3-dimensional world around us may be wrong.
Example: Sphere of radius r = 1. What is the fraction of
the volume of the sphere in a D-dimensional space which
lies between radius r = 1 and r = 1 ?
Volume scales like rD , therefore the formula for the volume
of a sphere is VD (r) = KD rD .
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
VD (1) VD (1 )
= 1 (1 )D
VD (1)
73of 183
Introduction to Statistical
Machine Learning
Curse of Dimensionality
Fraction of the volume of the sphere in a D-dimensional
space which lies between radius r = 1 and r = 1
VD (1) VD (1 )
= 1 (1 )D
VD (1)
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Linear Basis Function
Models
D = 20
volume fraction
Regularized Least
Squares
D =5
0.8
D =2
0.6
Bayesian Regression
Example for Bayesian
Regression
D =1
Predictive Distribution
0.4
Limitations of Linear
Basis Function Models
0.2
0.2
0.4
0.6
0.8
74of 183
Introduction to Statistical
Machine Learning
Curse of Dimensionality
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
2
Linear Basis Function
Models
p(r)
D=1
D=2
Regularized Least
Squares
D = 20
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
2
r
75of 183
Introduction to Statistical
Machine Learning
Curse of Dimensionality
Probability density with respect to radius r of a Gaussian
distribution for various values of the dimensionality D.
Example: D = 2; assume = 0, = I
1
1
N (x | 0, I) =
exp xT x
2
2
1
1
=
exp (x12 + x22 )
2
2
MLSS
2010
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Coordinate transformation
x1 = r cos()
c 2010
Christfried Webers
NICTA
The Australian National
University
x2 = r sin()
Regularized Least
Squares
Bayesian Regression
1
1
r exp r2
2
2
76of 183
Introduction to Statistical
Machine Learning
Curse of Dimensionality
Probability density with respect to radius r of a Gaussian
distribution for D = 2 (and = 0, = I)
p(r, | 0, I) =
1
1
r exp r2
2
2
p(r | 0, I) =
1
1
r exp r2
2
2
1
d = r exp r2
2
D=1
p(r)
Regularized Least
Squares
Bayesian Regression
Example for Bayesian
Regression
Predictive Distribution
Limitations of Linear
Basis Function Models
D=2
1
MLSS
2010
Linear Basis Function
Models
c 2010
Christfried Webers
NICTA
The Australian National
University
D = 20
2
r
77of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Part III
MLSS
2010
Classification
Linear Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
78of 183
Introduction to Statistical
Machine Learning
Classification
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
2
The Perceptron
Algorithm
4
Probabilistic Generative
Models
6
Discrete Features
8
Logistic Regression
4
Feature Space
79of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
80of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
81of 183
Introduction to Statistical
Machine Learning
Linear Model
Idea: Use again a Linear Model as in regression: y(x, w) is
a linear function of the parameters w
y(xn , w) = wT (xn )
But generally y(xn , w) R.
Example: Which class is y(x, w) = 0.71623 ?
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
8
4
8
82of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Activation function: f ()
Link function : f 1 ()
sign z
1.0
The Perceptron
Algorithm
0.5
0.5
0.0
0.5
1.0
0.5
1.0
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
83of 183
Generative Models
1
2
3
4
5
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
84of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
probability of a mistake
p(mistake) = p(x R1 , C2 ) + p(x R2 , C1 )
=
R1
p(x, C2 ) dx +
R2
p(x, C1 ) dx
MLSS
2010
Classification
Generalised Linear
Model
x0
Decision Theory
Fishers Linear
Discriminant
p(x, C1 )
The Perceptron
Algorithm
p(x, C2 )
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
x
R1
R2
85of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
E [L] =
k
Rj
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
86of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
p(C1 |x)
p(C2 |x)
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
0.0
reject region
Logistic Regression
Feature Space
87of 183
k = 1, . . . , K
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
88of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
wk0
wk
RD+1
1
x=
x
W = w1
D+1
R
...
wK
R(D+1)K
RK .
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
89of 183
Introduction to Statistical
Machine Learning
Determine W
c 2010
Christfried Webers
NICTA
The Australian National
University
1
tr (XW T)T (XW T)
2
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
90of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
91of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
8
4
Decision Theory
Discrete Features
8
2
Logistic Regression
Feature Space
92of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
6
6
6
6
Logistic Regression
Feature Space
93of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
y(x) = w x
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
94of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Logistic Regression
Feature Space
95of 183
1
N1
xn ,
m2 =
nC1
1
N2
xn
nC2
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
6
96of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
(yn mk )2
MLSS
2010
where yn = w xn .
Maximise the Fisher criterion
Classification
Generalised Linear
Model
J(w) =
(m2 m1 )
s21 + s22
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
0
Logistic Regression
Feature Space
6
97of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
J(w) =
w SB w
wT SW w
Classification
Generalised Linear
Model
SB = (m2 m1 )(m2 m1 )T
SW is the within-class covariance
SW =
nC1
(xn m1 )(xn m1 )T +
MLSS
2010
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
nC2
(xn m2 )(xn m2 )T
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
98of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
J(w) =
wT SB w
wT SW w
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
S1
W (m2
m1 )
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
99of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
100of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
101of 183
+1, a 0
1, a < 0
+1, if C1
1, if C2
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
102of 183
wT (xn )tn
nM
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
103of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
w n tn
MLSS
2010
Classification
nM
Generalised Linear
Model
104of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
w( +1) = w( ) + n tn
Classification
Generalised Linear
Model
Decision Theory
0.5
Fishers Linear
Discriminant
0.5
The Perceptron
Algorithm
0.5
0.5
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
1
1
0.5
0.5
1
1
0.5
0.5
105of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
w( +1) = w( ) + n tn
Classification
Generalised Linear
Model
Decision Theory
0.5
Fishers Linear
Discriminant
0.5
The Perceptron
Algorithm
0.5
0.5
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
1
1
0.5
0.5
1
1
0.5
0.5
106of 183
because (n tn ) n tn = n tn > 0.
BUT: contributions to the error from the other misclassified
patterns might have increased.
AND: some correctly classified patterns might now be
misclassified.
Perceptron Convergence Theorem : If the training set is
linearly separable, the perceptron algorithm is guaranteed
to find a solution in a finite number of steps.
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
107of 183
Generative Models
1
2
3
4
5
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
108of 183
p(C1 | x) =
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
109of 183
Introduction to Statistical
Machine Learning
Logistic Sigmoid
The logistic sigmoid function (a) =
c 2010
Christfried Webers
NICTA
The Australian National
University
1
1+exp(a)
MLSS
2010
Classification
Generalised Linear
Model
1.0
The Perceptron
Algorithm
4
0.8
Probabilistic Generative
Models
2
0.6
0.2
0.4
0.6
0.4
0.8
Discrete Features
1.0
Logistic Regression
2
Feature Space
0.2
10
10
Logit a()
110of 183
p(x | Ck ) p(Ck )
=
j p(x | Cj ) p(Cj )
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
exp(ak )
j exp(aj )
Classification
Generalised Linear
Model
Inference and Decision
where
ak = ln(p(x | Ck ) p(Ck )).
Also called softmax function as it is a smoothed version of
the max function.
Example: If ak
aj for all j = k, then p(Ck | x) 1, and
p(Cj | x) 0.
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
111of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
2.5
Decision Theory
Fishers Linear
Discriminant
1.5
1
The Perceptron
Algorithm
0.5
0
Probabilistic Generative
Models
0.5
Discrete Features
1
1.5
Logistic Regression
2
2.5
Feature Space
112of 183
p(x | Ck ) =
i=1
xkii (1 ki )1xi
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
113of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
p(x | Ck ) =
i=1
xkii (1 ki )1xi
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
p(Ck | x) =
p(x | Ck )p(Ck )
=
j p(x | Cj )p(Cj )
exp(ak )
j exp(aj )
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
ak (x) =
i=1
Logistic Regression
Feature Space
114of 183
Generative Models
1
2
3
4
5
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
115of 183
shared covariance
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
116of 183
p(t | w) =
ytnn (1
n=1
1tn
yn )
where yn = p(C1 | n ).
Error function : negative log likelihood resulting in the
cross-entropy error function
n=1
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
E(w) = ln p(t | w) =
Introduction to Statistical
Machine Learning
{tn ln yn + (1 tn ) ln(1 yn )}
Logistic Regression
Feature Space
117of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
E(w) =
n=1
{tn ln yn + (1 tn ) ln(1 yn )}
yn = p(C1 | n ) = (wT n )
Gradient of the error function (using
Classification
d
da
= (1 ) )
n=1
Generalised Linear
Model
Inference and Decision
Decision Theory
E(w) =
MLSS
2010
(yn tn )n
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
Discrete Features
Logistic Regression
Feature Space
118of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
x2
Probabilistic Generative
Models
0.5
Discrete Features
Logistic Regression
Feature Space
1
0
1
x1
0.5
1
119of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Classification
Generalised Linear
Model
Inference and Decision
Decision Theory
Fishers Linear
Discriminant
The Perceptron
Algorithm
Probabilistic Generative
Models
1
2
x2
Discrete Features
0
Logistic Regression
0.5
Feature Space
1
0
1
x1
0.5
120of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Part IV
MLSS
2010
Neural Networks
Parameter Optimisation
Neural Networks
121of 183
Introduction to Statistical
Machine Learning
Functional Transformations
As before, the biases can be absorbed into the weights by
introducing an extra input x0 = 1 and a hidden unit z0 = 1.
(2)
wkj h
yk (x, w) = g
c 2010
Christfried Webers
NICTA
The Australian National
University
j=0
(1)
wji xi
MLSS
2010
i=0
Neural Networks
Parameter Optimisation
(2)
wkj j (x)
yk (x, w) = g
j=0
(1)
hidden units
zM
wM D
(2)
wKM
xD
yK
outputs
inputs
y1
x1
z1
x0
z0
(2)
w10
122of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Neural Networks
-10
-10
-5
Parameter Optimisation
-5
x2
x2
0
5
5
10
-10
10
-10
-5
-5
0
x1
x1
10
10
w = (0, 1, 0.1)
w = (0, 0.1, 1)
-10
-10
-5
-5
x2
x2
0
5
5
10
-10
10
-10
-5
-5
0
x1
x1
5
10
5
10
123of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Neural Networks
Parameter Optimisation
124of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Neural Networks
Parameter Optimisation
125of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Neural Networks
Parameter Optimisation
126of 183
1, x 0
0, x < 0
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Neural Networks
Parameter Optimisation
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
3
Neural Networks
Parameter Optimisation
1
0
1
2
2
Introduction to Statistical
Machine Learning
Parameter Optimisation
c 2010
Christfried Webers
NICTA
The Australian National
University
1
2
n=1
MLSS
2010
Neural Networks
y(xn , w) tn 2 ,
Parameter Optimisation
by gradient descent.
129of 183
Introduction to Statistical
Machine Learning
Error Backpropagation
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Neural Networks
wkj k .
Parameter Optimisation
zi
wji
En (w)
wji
= j zi .
j
wkj
zj
1
130of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Neural Networks
Parameter Optimisation
En (wji + ) En (wji )
En (w)
=
+ O( )
wji
or the numerically more stable (fewer round-off errors)
symmetric differences
En (wji + ) En (wji )
En (w)
=
+ O( 2 )
wji
2
which both need O(W 2 ) operations.
131of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Neural Networks
Parameter Optimisation
M =1
M =3
M=1
M=3
M = 10
M = 10
132of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Neural Networks
Parameter Optimisation
0.45
0.25
0.4
0.2
0.15
10
20
30
40
50
0.35
10
20
30
40
50
133of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Part V
MLSS
2010
Kernel Methods
Maximum Margin
Classifiers
134of 183
Introduction to Statistical
Machine Learning
Kernel Methods
c 2010
Christfried Webers
NICTA
The Australian National
University
N
n=1
MLSS
2010
Kernel Methods
Maximum Margin
Classifiers
N
n=1 n k(xn , x)
N
n=1 L(tn , (K)n )
with minimising
+ T K,
and Kernel Kij = Kji = k(xi , xj )
Kernel trick based on Mercers theorem: Any continuous,
symmetric, positive semi-definite kernel function k(x, y)
can be expressed as a dot product in a high-dimensional
(possibly infinite dimensional) space.
135of 183
Introduction to Statistical
Machine Learning
tn {1, 1}
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Kernel Methods
Maximum Margin
Classifiers
y = 1
y=0
y=1
136of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
w=
n (xn )
n=1
MLSS
2010
Kernel Methods
Maximum Margin
Classifiers
f (x) = wT (x) =
n k(xn , x)
n=1
137of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Kernel Methods
Maximum Margin
Classifiers
y = 1
y=0
y=1
>1
<1
=0
=0
138of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Kernel Methods
Maximum Margin
Classifiers
2
2
139of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Part VI
MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
140of 183
Introduction to Statistical
Machine Learning
K-means Clustering
Goal: Partition N features xn into K clusters using
Euclidian distance d(xi , xj ) = xi xj such that each
feature belongs to the cluster with the nearest mean.
N
Distortion measure : J(, cl(xi )) = n=1 d(xi , cl(xi ) )2
where cl(xi ) is the index of the cluster centre closest to xi .
Start with K arbitrary cluster centres k .
M-step: Minimise J w.r.t. cl(xi ): Assign each data point xi
to closest cluster with index cl(xi ).
E-step: Minimise J w.r.t. k : Find new k as the mean of
points belonging to cluster k.
Iteration over M/E-steps converges to local minimum of J.
2
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
(a)
2
2
2
141of 183
Introduction to Statistical
Machine Learning
(a)
(b)
c 2010
Christfried Webers
NICTA
The Australian National
University
(c)
MLSS
2010
2
2
2
(d)
2
2
(e)
K-means Clustering
2
2
(f)
2
2
(g)
2
2
(h)
2
0
(i)
2
2
2
2
142of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Mixture of Gaussians:
P(x | , , ) =
K
k=1 k N (x | k , k )
Maximise likelihood
P(x | , , ) w.r.t. , , .
M-step: Minimise J
w.r.t. cl(xi ): Assign
each data point xi to
closest cluster with
index cl(xi ).
E-step: Minimise J
w.r.t. k : Find new
k as the mean of
points belonging to
cluster k.
0.30
0.30
0.25
0.25
0.20
0.20
0.15
0.15
0.10
0.10
0.05
MLSS
2010
0.05
0.00-20
-15
-10
-5
10
15
0.00-20
20
-15
-10
-5
10
15
K-means Clustering
20
EM for Gaussian
Mixtures - Latent
Variables
L=1
Convergence of EM
2
2
2
(a)
2
2
(b)
2
L=2
2
(d)
(c)
(f)
2
0
L = 20
2
2
L=5
2
2
(e)
143of 183
Introduction to Statistical
Machine Learning
(zk ) =
3
1
=
Nk
new
=
k
1
Nk
(znk ) xn
n=1
knew
MLSS
2010
K-means Clustering
Mixture Models and EM
c 2010
Christfried Webers
NICTA
The Australian National
University
Nk
=
N
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
n=1
new T
(znk )(xn new
k )(xn k )
ln p(X | , , ) =
ln
n=1
k=1
new
knew N (x | new
k , k )
144of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
K-means Clustering
Mixture Models and EM
p(x | ) =
i=1
xi i (1 i )1xi
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
E [x] =
cov[x] = diag{i (1 i )}
145of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
p(x | , ) =
with
k=1
k p(x | k )
MLSS
2010
p(x | k ) =
i=1
xkii (1 ki )1xi
Nk =
k p(xn | k )
(znk ) =
K-means Clustering
j p(xn | j )
Convergence of EM
(znk )
n=1
x =
k =
1
Nk
(znk )xn
k =
x
n=1
Nk
N
146of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
K-means Clustering
Examples from a digits data set, each pixel taken only binary
values.
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
K-means Clustering
ln p(X | ) = ln
p(X, Z | )
Convergence of EM
Introduction to Statistical
Machine Learning
EM - Key Idea
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
149of 183
Introduction to Statistical
Machine Learning
EM Algorithm
1
2
3
MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
where
Q(,
old
)=
Z
c 2010
Christfried Webers
NICTA
The Australian National
University
p(Z | X,
old
) ln p(X, Z | )
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
Introduction to Statistical
Machine Learning
EM Algorithm - Convergence
c 2010
Christfried Webers
NICTA
The Australian National
University
Start with the product rule for the observed variables x, the
unobserved variables Z, and the parameters
ln p(X, Z | ) = ln p(Z | X, ) + ln p(X | ).
Apply
q(Z) ln p(X, Z | ) =
K-means Clustering
Mixture Models and EM
Rewrite as
ln p(X | ) =
MLSS
2010
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
q(Z) ln
Z
p(X, Z | )
q(Z)
q(Z) ln
Z
L(q,)
p(Z | X, )
q(Z)
KL(q p)
Introduction to Statistical
Machine Learning
Kullback-Leibler Divergence
c 2010
Christfried Webers
NICTA
The Australian National
University
q(y) ln
y
KL(q p) =
q(y) ln
q(y)
p(y)
q(y)
dy
p(y)
=
=
q(y) ln
y
q(y) ln
KL(q p) 0
not symmetric: KL(q p) = KL(p q)
KL(q p) = 0 iff q = p.
invariant under parameter transformations
MLSS
2010
p(y)
q(y)
p(y)
dy
q(y)
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
152of 183
Introduction to Statistical
Machine Learning
EM Algorithm - Convergence
c 2010
Christfried Webers
NICTA
The Australian National
University
q(Z) ln
Z
p(X, Z | )
q(Z)
q(Z) ln
Z
L(q,)
p(Z | X, )
q(Z)
KL(q p)
MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
KL(q||p)
Convergence of EM
L(q, )
ln p(X|)
153of 183
Introduction to Statistical
Machine Learning
EM Algorithm - E Step
old
q(Z) ln
Z
p(X, Z | )
q(Z)
L(q,)
q(Z) ln
Z
old
) with
p(Z | X, )
q(Z)
KL(q p)
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
KL(q||p) = 0
L(q, old )
ln p(X| old )
154of 183
Introduction to Statistical
Machine Learning
EM Algorithm - M Step
old
q(Z) ln
Z
p(X, Z | )
q(Z)
q(Z) ln
Z
L(q,)
p(Z | X, )
q(Z)
KL(q p)
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
KL(q||p)
L(q, new )
ln p(X| new )
155of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
ln p(X|)
MLSS
2010
K-means Clustering
Mixture Models and EM
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
L (q, )
new
old
156of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Part VII
MLSS
2010
Sampling from the
Uniform Distribution
Sampling
157of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
158of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
(n+1)
16
(n)
= (2 + 3) z
mod 2
31
Importance Sampling
Markov Chain Monte
Carlo - The Idea
159of 183
in 3D . . .
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea
160of 183
Introduction to Statistical
Machine Learning
(n+1)
(n) T
,z
,z
Plotting z
in 3D . . . and changing the
viewpoint results in 15 planes.
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea
161of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
mod 231
= (6(216 + 3) 9)z(n)
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea
= 6z(n+1) 9z(n)
h(y) =
p(x) dx
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea
y = h1 (z)
to obtain samples y distributed according to p(y).
163of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
1
h(y)
p(y)
y
164of 183
ey
0
0y
y<0
h(y) =
p(x) dx =
ey dx = 1 ey
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea
1/2
2 ln r2
r2
1/2
2 ln r
r2
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea
2
2
(z1 , z2 )
1
1
= ey1 /2 ey2 /2
(y1 , y2 )
2
2
166of 183
Introduction to Statistical
Machine Learning
Rejection Sampling
Assumption 1 : Sampling directly from p(z) is difficult, but
we can evaluate p(z) up to some unknown normalisation
constant Zp
1
p(z) = p(z)
Zp
Assumption 2 : We can draw samples from a simpler
distribution q(z) and for some constant k and all z holds
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
kq(z) p(z)
Importance Sampling
Markov Chain Monte
Carlo - The Idea
kq(z)
kq(z0 )
u0
z0
p(z)
z
167of 183
Introduction to Statistical
Machine Learning
Rejection Sampling
1
2
3
4
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
kq(z)
kq(z0 )
u0
z0
p(z)
z
168of 183
Introduction to Statistical
Machine Learning
Importance Sampling
Provides a framework to directly calculate the expectation
Ep [f (z)] with respect to some distribution p(z).
Does NOT provide p(z).
Again use a proposal distribution q(z) and draw samples z
from it.
Then
E [f ] =
f (z) p(z) dz =
p(z)
f (z)
q(z)
p(z)
1
q(z) dz
q(z)
L
l=1
p(z(l) ) (l)
f (z )
q(z(l) )
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea
f (z)
z
169of 183
Introduction to Statistical
Machine Learning
p(z)
Zp
q(z) =
q(z)
.
Zq
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Zq 1
Zp L
(l)
rl f (z(l) )
rl =
l=1
p(z )
.
q(z(l) )
Zq
L
rl ,
l=1
E [f ]
wl f (z(l) )
l=1
wl =
rl
L
m=1 rm
170of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea
171of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
(r)
z
z(r)
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea
if accepted
if rejected
172of 183
Introduction to Statistical
Machine Learning
Metropolis Algorithm
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea
z(l+1) =
z
z(r)
if accepted
if rejected
173of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
2.5
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea
1.5
1
0.5
0
0.5
1.5
2.5
p(z ) qk (z( ) | z )
p(z( ) ) qk (z | z( ) )
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Importance Sampling
Markov Chain Monte
Carlo - The Idea
175of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Part VIII
MLSS
2010
More Machine Learning
176of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
More Machine Learning
177of 183
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
Part IX
MLSS
2010
Journals
Books
Resources
Datasets
178of 183
Journals
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Journals
Books
Datasets
179of 183
Conferences
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Journals
Books
Datasets
180of 183
Introduction to Statistical
Machine Learning
Books
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Journals
Books
Datasets
Christopher M. Bishop
181of 183
Introduction to Statistical
Machine Learning
Books
Pattern Classification
c 2010
Christfried Webers
NICTA
The Australian National
University
Introduction to Machine
Learning
MLSS
2010
Journals
Books
Datasets
Ethem Alpaydin
182of 183
Datasets
UCI Repository
http://archive.ics.uci.edu/ml/
UCI Knowledge Discovery Database Archive
http://kdd.ics.uci.edu/summary.data.
application.html
Statlib
http://lib.stat.cmu.edu/
Delve
http://www.cs.utoronto.ca/~delve/
Time Series Database
http://robjhyndman.com/TSDL
Introduction to Statistical
Machine Learning
c 2010
Christfried Webers
NICTA
The Australian National
University
MLSS
2010
Journals
Books
Datasets
183of 183