Error Gradient

Supervised
Learning:
Linear Models
P J Narayanan
IIIT Hyderabad
Outline
•  Model fi:ng: Predic<on of y from x
–  Linear regression: Model is linear
–  Gradient descent method to find best fi:ng line

•  Logis<c regression: when y is binary
–  Func<on model is different
–  Gradient descent works
–  Used for classifica<on. Gives probabili<es
•  Linear classifiers: Find line to separate classes

–  Linear separability
IIIT Hyderabad
February 10, 2018 AI/ML Course 2

The machine learning framework
y = f(x)
output prediction function feature or
representation
•  Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction
function f by minimizing the prediction error.
•  Testing: apply f to a never before seen test example x and output the predicted value y = f(x)
IIIT Hyderabad
Slide credit: L. Lazebnik

Fi:ng Func<ons to Data
Why? Y
•  Discover (hidden) structure in
the data, given samples

•  A func<onal form is a
compact representa<on
usable for interpola<on and X
extrapola<on

•  Forms: Lines, Polynomials,
Gaussian, etc. y = f (x)

Called regression in general Scalar or vector valued x, y
IIIT Hyderabad
Multivariate when x is a vector x

Linear Model
•  Linear when f is a line.
In general, a hyperplane
–  y = a x + b
•  a, b: parameters of the model
–  y = wTx = w x when mul<variate
From
•  x = [1 x1 x2 …. xd]T Wolfram
•  Vector w represents the parameters of the model

–  Line: Only 2 parameters in 2D.
d parameters in a d-dimensional space
y = f (x) = wT x
IIIT Hyderabad

Nota<ons
2 3 2 3
1 w0
6 x1 7 6 w1 7
6 7 6 7
x=6
6 x2 7
7 w=6
6 w2 7
7
4 .. 5 4 .. 5
xd wd
T
w·x = w x = w0 .1 + w1 x1 + w2 x1 + · · · + wd xd
IIIT Hyderabad

The Problem
•  Find w given examples (xi , yi), i = 1, 2, …, m

•  Supervised situa<on: output label is available for a
number m of training input samples

•  Objec<ve: predict y values for input values x that
are not seen before
–  Called generaliza-on in Machine Learning

•  How do we find w? Gradient descent!
IIIT Hyderabad

Typical Scenario
• Data points of x against y Y
appear distributed

• Ideal: All points on the line
if model is truly linear
–  Measurement errors and
“noise” create devia<ons

• Several ways to find the X
best line
–  Analy<cal, Least Squared Error,
IIIT Hyderabad
Gradient Descent

Larger Issues
•  What’s great about (x, y)
linear? r
–  Simple θ

•  But the world is not linear

•  But many can be
converted to! x = r cos ✓, y = r sin ✓
–  Circular to linear
–  ExOR to linear Circle is r = k in the r-θ space
–  Pendulum: T^2 to L is
IIIT Hyderabad
linear a r + b θ + c = 0 is a weird shape!

Ques<ons?
IIIT Hyderabad

Itera<ve Procedure
•  Start with a guess for model
parameters w y = fw (x) = wT x
–  Adjust <ll it fits well

•  Predic<on for a given x: fw(x) = T
wTx
fw (x) = w x

•  Consider the jth training sample
(x j, y j) j T j
fw (x ) = w x
•  Predicted value:
Observed value: y j
j j
•  Typically not equal! The y 6= fw (x )
IIIT Hyderabad
difference guides change in w

Loss Func<on
•  Error or loss: D(). How far is j j
the predic<on from the D(fw (x ), y )
observed value?
–  Different loss func<ons used
X
j j
•  Strategy: Bring the predicted D(fw (x ), y )
value closer to the observed j
value by adjus<ng w

•  How do we adjust w?
Gradient descent

IIIT Hyderabad

Gradient Descent
• Minimum of the func<on
lies in the opposite
J
direc<on of the gradient

1. Posi<ve gradient: func<on
will increase if we go forward
2. Nega<ve gradient: minimum
lies ahead

1
•  Take a step against
2
gradient:
W
0
w = w ⌘ rJ(w)
IIIT Hyderabad

Fine Points
When do we stop the itera<ons?

•  When the gradient value is too low (< ε)
–  Future changes will be low!
•  When the change in objec<ve func<on is too small
–  We are close already

How about the step size?
•  Ensure smooth convergence
IIIT Hyderabad

Gradient Descent
•  Converges to the
minimum when J
func<on is convex
–  Converges to a local
minimum otherwise

•  Learning rate is cri<cal
–  Start high to make rapid
strides
–  Reduce with <me for W
smoother convergence
IIIT Hyderabad

Ques<ons?
IIIT Hyderabad

Minimize Loss Func<on
•  A loss func<on: L2 or 1
Euclidean distance J(w) = ||(fw (xj ) y j )||2
2

1
•  J(w) is the func<on to = (fw (xj ) y j )2
2
be minimized with
respect to w
m
X
1 j j 2
•  Cost due to a sample & = (fw (x ) y )
Cost for all of them
2 j=1
IIIT Hyderabad

LMS Update Rule
•  Gradient descent 0
follows this equa<on w =w ⌘rJ(w)

•  What is the gradient of
J(w)?
1
rw (fw (xj ) y j )2 = (fw (xj ) y j )xj
2
•  This is a vector of d
dimensions, like x, w
IIIT Hyderabad

Batch and Stochas<c GD
•  Batch GD: Go through all
input samples and update at X
0
end w =w ⌘ (fw (xj ) y j )xj
–  Uses “true” gradient j
–  Expensive computa<onally

•  Stochas1c GD: Update
weights aner each sample
–  More “noisy”, but faster w0 = w ⌘(fw (xj ) y j )xj
–  Noise may actually help!

•  Mini-batch: Update aner a
IIIT Hyderabad
small number of samples

Summary
•  Linear regression fits a line or a hyperplane to a set of
points
–  Models the behaviour of the (con<nuous) dependent
variable against the independent one

•  Several methods exist to perform line fi:ng
–  Itera<ve methods work well in several situa<ons
–  Machine learning uses lots of data. Incremental methods
are more suitable

•  Gradient Descent is a versa<le method!
IIIT Hyderabad

Ques<ons?
IIIT Hyderabad

Categorical/Binary Func<ons
•  When y value is a category
label (with no clear
rela<ve ordering)

•  Easy case: y is binary.
Yes/No. True/False. 1/0.

•  Can be interpreted as the
probability of an outcome
being true, given an event
IIIT Hyderabad

Logis<c Func<on
•  The logis1c func<on 1
f (z) =
comes handy 1+e z

•  Goes from 0 to 1 as z
goes from –∞ to +∞

Wikipedia
•  Can be interpreted as a
probability

P(success) = f(effort) Use kz to make it steeper.
IIIT Hyderabad
(z – z0) to shin transi<on point

Logis<c Func<on
•  The logis1c func<on 1
–  Probability of an outcome f (z) = z
1+e

•  Has an interes<ng
0
deriva<ve form f (z) = f (z)(1 f (z))

•  Connect with linear: 1
z = wTx fw (x) = w Tx
1+e
•  Generalized Linear Model
IIIT Hyderabad
with parameters w

Logis<c Regression
•  Fit a logis<c func<on to
the outcome y with
respect to x. w0 = w ⌘(fw (xj ) y j )xj

•  Gradient Descent can be The Maths is involved
used to minimize and uses Maximum
the loss func<on Likelihood es<mate, etc.

•  Results in the exact same
update rule as linear Can use Batch or Stochas<c
regression though f is Gradient Descent methods
IIIT Hyderabad
non-linear
Binary Classifica<on
•  Logis<c regression gives probability of outcome

•  Convert to a classifier with output True or False

•  Classifica<on rule:

if f(wTx) > 0.5 True. Else False

•  Can be extended to mul<ple classes also
IIIT Hyderabad

Summary
•  Logis<c func<on can map inputs to the probability of a

categorical output
–  Can be used as a classifier for Yes/No ques<ons

•  Another form of f(x). Another form of loss func<on

•  Gradient Descent works well for this also.
All one needs is a gradient

IIIT Hyderabad
•  Differen<ability of the loss func<on is important!

Ques<ons?
IIIT Hyderabad

Linear Classifier
•  Logis<c regression fits a Y
func<on to y (outcome)
of the independent
variable x

•  Classifica<on can be done
by par<<oning the space
among the classes
X
•  Liner classifiers have
linear par<<on or
IIIT Hyderabad
decision boundaries
Decision Boundary
•  Decision boundary: hyperplane
T
w x=0
•  Class 1 lies on the posi<ve side
and Class 0 on the nega<ve side
– t=1 for Class 1 and t= -1 for Class 0
– Negate features of Class 0 by (tx)
1X T
•  For each training sample xj, J(w) = (w (txj ))2
distance to line wT (txj) ≥ 0 2 j

•  Loss func<on J(w) X
rJ(w) = [wT (txj )] (txj )
•  Gradient Descent can work! j
IIIT Hyderabad

Boundary with Margin
•  Several lines separate the
classes when there is a
Y
large gap

•  Need to find the middle
line with maximum
distance to points of each
class

•  Require wT (txj) ≥ b,
where b is a margin X

IIIT Hyderabad
Will see this later!

What’s truly important?
•  Every sample
contributes by wTx Y
–  Suscep<ble to samples
that are far from the
boundary
–  Called outliers

•  Samples close to the
boundary alone should
mauer in finding the
boundary! X
–  Will see it in SVM
IIIT Hyderabad

Linear Separability
•  No line can separate
the classes cleanly Y
–  This is called the ExOR
problem

•  Some can be
transformed to a
linearly separable case
–  Map x to ϕ(x)
–  Separable in ϕ(x) X
IIIT Hyderabad

Summary & Ques<ons
•  Linear methods are simple and versa<le

–  Several situa<ons can be mapped to linear

•  Other advanced methods are varia<ons or
extensions of simple linear methods
–  Support Vector Machines
–  Neural Networks including Deep ones
IIIT Hyderabad

Thank You!
IIIT Hyderabad

Error Gradient

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Error Gradient

Enviado por

Direitos autorais:

Formatos disponíveis

Supervised

• Linear classiﬁers: Find line to separate classes

February 10, 2018 AI/ML Course 2

Slide credit: L. Lazebnik

February 10, 2018 AI/ML Course 3

Multivariate when x is a vector x

February 10, 2018 AI/ML Course 4

• Vector w represents the parameters of the model

February 10, 2018 AI/ML Course 5

February 10, 2018 AI/ML Course 6

February 10, 2018 AI/ML Course 7

February 10, 2018 AI/ML Course 8

linear a r + b θ + c = 0 is a weird shape!

February 10, 2018 AI/ML Course 9

February 10, 2018 AI/ML Course 10

diﬀerence guides change in w

February 10, 2018 AI/ML Course 11

February 10, 2018 AI/ML Course 12

February 10, 2018 AI/ML Course 13

When do we stop the itera<ons?

February 10, 2018 AI/ML Course 14

February 10, 2018 AI/ML Course 15

February 10, 2018 AI/ML Course 16

February 10, 2018 AI/ML Course 17

February 10, 2018 AI/ML Course 18

small number of samples

February 10, 2018 AI/ML Course 19

February 10, 2018 AI/ML Course 20

February 10, 2018 AI/ML Course 21

February 10, 2018 AI/ML Course 22

(z – z0) to shin transi<on point

February 10, 2018 AI/ML Course 23

February 10, 2018 AI/ML Course 24

• Logis<c regression gives probability of outcome

February 10, 2018 AI/ML Course 26

• Logis<c func<on can map inputs to the probability of a

• Diﬀeren<ability of the loss func<on is important!

February 10, 2018 AI/ML Course 27

February 10, 2018 AI/ML Course 28

February 10, 2018 AI/ML Course 30

Will see this later!

February 10, 2018 AI/ML Course 31

February 10, 2018 AI/ML Course 32

February 10, 2018 AI/ML Course 33

• Linear methods are simple and versa<le

February 10, 2018 AI/ML Course 34

February 10, 2018 AI/ML Course 35

Você também pode gostar

•  Linear classiﬁers: Find line to separate classes

•  Vector w represents the parameters of the model

•  Logis<c regression gives probability of outcome

•  Logis<c func<on can map inputs to the probability of a

•  Diﬀeren<ability of the loss func<on is important!

•  Linear methods are simple and versa<le