w5 PDF

Linear Regression
CSL465/603 - Fall 2016

Narayanan C Krishnan
ckn@iitrpr.ac.in
Outline
Univariate regression
Multivariate regression
Probabilistic view of regression
Loss functions
Bias-Variance analysis
Regularization
Linear Regression
CSL465/603 - Machine Learning
Example - Green Chilies

Entertainment Company
Earnings from the

film (in crores of Rs)
Cost of making the film (in crores of Rs)

Linear Regression
Notations
Training dataset
Number of examples -
Input variable - x#
Target variable - %
Goal: Learn function

that predicts for new
input x
Linear Regression
Cost of Film
(Crores of Rs) x
Profit/Loss
(Crores of Rs) y
98.28
199.69
40.22
93.69
62.07
100.33
Linear Regression
Simplest form
(x) = + + - x
Earnings from the


Linear Regression
Least Mean Squares - Cost

Function
Choose parameters + and - (or w ) so that x is
as close as to
Earnings from the


Linear Regression

Function - Parameter Space (1)
Let
+ , - =
Linear Regression
23
3
%8- 6 6

Let
+ , - =
Linear Regression
23
3
%8- % %

Let
+ , - =
Linear Regression
23
3
%8- 6 6
Plot of the Error Surface
Linear Regression
10
Contour Plot of Error Surface
Linear Regression
11
Estimating Optimal Parameters
Linear Regression
12
Gradient Descent Basic

Principle
Minimize w =
3
23 %8-
x% %
Start with an initial estimate for w

Keep changing w so that w is progressively
reduced
Stop when no change or have reached the
minimum
Linear Regression
13
Gradient Descent - Intuition
Linear Regression
14
Effect of Learning Parameter

Too small value slow
convergence
Linear Regression
Too large value

oscillates widely and
may not converge
15
Gradient Descent Local Minima

Depending on the function w , gradient descent
can get stuck at local minima
Linear Regression
16
Gradient Descent for Regression

Convex error function
3
1
w =
; x# %
2
%8-
Geometrically error surface is bowl shaped.

Only global minima
Exercise Prove that the sum of

squared error is a convex function
Linear Regression
17
Parameter Update (1)

Minimize
1
w =
; x# %
2
%8-
Linear Regression
18
Parameter Update (2)

Repeat till convergence
1
+ = + ; x# %
%8-
1
- = - ; x# % x#
%8-
Linear Regression
19
Example Iteration 0
Regression Function
Linear Regression
Error Function
20
Example Iteration 1
Regression Function
Linear Regression
Error Function
21
Example Iteration 2
Regression Function
Linear Regression
Error Function
22
Example Iteration 4
Regression Function
Linear Regression
Error Function
23
Example Iteration 7
Regression Function
Linear Regression
Error Function
24
Example Iteration 9
Regression Function
Linear Regression
Error Function
25
Gradient Descent
Batch Mode
Update includes contribution of all data points
3
1
+ = + ; x# %
%8-
1
- = - ; x# % x#
%8-
Will talk stochastic gradient descent later (neural

networks).
Linear Regression
26
Multivariate Linear Regression

Cost of Film
(Crores of
Rs)
Celebrity
status of the
protagonist
# of theatres
release
Age of the Earnings

protagonist (Crores of Rs) y
75.72
7.57
32
52
157.39
18.74
1.87
16
68
81.93
50.96
5.09
27
35
131.95
Dimension of the input data -
Linear Regression
27
Multivariate Linear Regression Formulation

Simplest model:
x = + + - - + 2 2 + + ? ?
Parameters to learn: + , - , , ? =
Cost function: =
23
3
%8- x# %
Update equation: B = B 3
%8- x% % %B
3
Linear Regression
28
Gradient Descent
Parameter update equation
3
1
B = B ; x% % %B
%8-
Linear Regression
29
Feature Scaling Multivariate

Linear Regression (1)
Cost of Film
(Crores of
Rs)
Celebrity
status of the
protagonist
# of theatres
release
Age of the
Profit/Loss
75.72
7.57
32
52
157.39
18.74
1.87
16
68
81.93
50.96
5.09
27
35
131.95
Transform features to be of same scale

Linear Regression
30
Feature Scaling for Multivariate

Linear Regression (2)
Normalization 1 D 1 or 0 D 1
Standardization mean 0 and standard deviation 1
Linear Regression
31
Multivariate Linear Regression

Analytical Solution
Cost of Film
(Crores of
Rs)
Celebrity
status of the
protagonist
# of theatres
release
Age of the
Profit/Loss
75.72
7.57
32
52
157.39
18.74
1.87
16
68
81.93
50.96
5.09
27
35
131.95
Design Matrix and Target Vector
X=
Linear Regression
1 -1 2
1 3-
-?
2?

3?
32
Least Squares Method

X = X =
1 -1 2
1 3-
-?
2?

3?
1
= ; x# %
2
%8-
Linear Regression
33
Normal Equations
1
min = min X N X
L
L 2
Finding the gradient wrt and equate it to 0
Linear Regression
34
Analytical Solution
Advantage
No need for the learning parameter !
No need for iterative updates
Disadvantage
Need to perform matrix inversion
Pseudo-Inverse of the matrix N P- N

Sometimes we deal with non-invertible matrices (redundant
features)
Linear Regression
35
Probabilistic View of Linear

Regression (1)
Let = () +
is the error term that captures unmodeled effects
or random noise.
~ 0, 2 - Gaussian distribution
Linear Regression
36

Regression (2)
Let = () +
or random noise.
~ 0, 2 - Gaussian distribution - why?
0, 2 has maximum
entropy among all real-valued
distributions with a specified
variance 2
3- rule:
Linear Regression
37

Regression (3)
Let = (x) +
or random noise.
~ 0, 2 - Gaussian distribution
Then
=
And
x =
Figure 1.28
The regression function y(x),

which minimizes the expected
squared loss, is given by the
mean of the conditional distribution p(t|x).
t
y(x)
y(x0 )
p(t|x0 )
x0
Linear Regression
47
1.5. Decision Theory
which is the conditional average of t conditioned on x and is known as the regression

38
- Machine
Learningin Figure 1.28. It can readily be extended to mulfunction.CSL465/603
This result
is illustrated

Regression (4)
- , , 3 x- , , x3 =
- , , 3 x- , , x3 ; =
Linear Regression
39
Maximizing the Likelihood

Maximize = 3
%8- % |x% ;
Linear Regression
40
Loss Functions
Squared loss 2
Absolute loss
Dead band loss max 0, , ]
Linear Regression
41
Loss Functions
Problem with squared loss
Linear Regression
42
Linear Regression with

Absolute Loss Function
Objective
min ; X
L
%8-
Non-differentiable, so cannot take the gradient

descent approach
Solution: frame as a constrained optimization
problem
Introduce new variables 3 , % x% %
3
min ; % , subject to % x% % %
L,`
Linear Regression
%8CSL465/603 - Machine Learning
43
Linear Regression with

Absolute Loss Function - Example
LMS output
Linear Regression
LP output
44
Some Additional Notations

Underlying response function (Target Concept)
Actual observed response = x +
~ 0, 2 , E /x = (x)
Predicted response based on the model learned

from dataset - x;
Expected response averaged over all datasets
m x = En x;
Expected 2 error on a new test instance x Epqq = En x ; 2
Linear Regression
45
Bias-Variance Analysis (1)
Linear Regression
46
Linear Regression
47

Root Mean Square Error
Linear Regression
48

9th degree polynomial fit with more sample data
Linear Regression
49

Expected square loss E = r x 2 x, x
Linear Regression
50

Expected square loss E = r x 2 x, x
Linear Regression
51

Relevant part of loss:
u x x
Linear Regression
x x
52

Relevant part of loss:
n ;
Linear Regression
53

Degree = 1
Degree = 4
Linear Regression
54

Bias term of the error En x; x
Measures how well our approximation architecture can fit

the data
Weak approximators will have high bias
Example low degree polynomials
Strong approximators will have low bias

Example high degree polynomials
Linear Regression
55

Variance term of the error
En x; En x;
No direct dependence on the target value
For a fixed size dataset

Strong approximators tend to have more variance
Small changes in the dataset can result in wide changes in the
predictors
Weak approximators tend to have less variance

Small changes in the dataset result in similar predictors
Variance disappears as
Linear Regression
56

Measuring Bias and Variance in practice
Bootstrap from the given dataset
Start with a complex approximator, and reduce the

complexity through regularization
Setting more coefficients/parameters to 0
Do Feature Selection
Reduces variance, but can increase bias.

Hopefully just sufficient to model the given data
Linear Regression
57
Regularization
Central Idea: penalize over-complicated solutions
Linear regression minimizes
3
; x% %
%8-
Regularized regression minimizes

3
; x% %
%8-
Linear Regression
58
Modified Solution
Solution for ordinary linear regression
1
min min N
z
L 2
= N P- N
Now for the regularized version which uses 2 norm
Ridge Regression
1
min min N + 2
z
L 2
= N +
P-
Exercise: derive the closed for solution for ridge regression with L2 regularizer
Linear Regression
59
How to choose ?
Tradeoff between complexity vs. goodness of the fit
Solution 1: If we have lots of data
Generate multiple models
Use lots of test data to discard the bad models
Solution 2: With limited data

Use k- fold cross validation
Will discuss later
Linear Regression
60
General Form of Regularizer

Term
3
; x% %
%8-
+ ; D
D8-
Quadratic/2 regularizer = 2
3.1. Linear Basis Function Models
Contours for the regularization term
q = 0.5
Figure 3.3
Linear Regression
q=1
q=2
145
q=4
Contours of the regularization term in (3.29) for various values of the parameter q.
61
Special scenario = 1 - LASSO

Least Absolute Shrinkage and Selection Operator
?
2
Error Function: 3
x
%
%8- %
D8- D
For sufficiently large many of the coefficients
become 0 resulting in a sparse solution
146
3. LINEAR MODELS FOR REGRESSION
Figure 3.4 Plot of the contours

of the unregularized error function
(blue) along with the constraint region (3.30) for the quadratic regularizer q = 2 on the left and the lasso
regularizer q = 1 on the right, in
which the optimum value for the parameter vector w is denoted by w .
The lasso gives a sparse solution in
which w1 = 0.
w2
w2
w1
Linear Regression
w1

For the remainder of this chapter
we shall focus on the quadratic regularizer
62
LASSO
Quadratic programming to solve the optimization
problem
Least Angles Regression solution - refer to ESL
http://web.stanford.edu/~hastie/glmnet_matlab/ matlab packages for LASSO
Linear Regression
63
Linear Regression with NonLinear Basis Functions

Linear combination of fixed non-linear functions of
the input variables
?
x = + + ; D x
140
3. LINEAR MODELS FOR REGRESSION
D8-
0.5
0.75
0.75
0.5
0.5
0.5
0.25
0.25
1
1
0
1
Figure 3.1 Examples of basis functions, showing polynomials on the left, Gaussians of the form (3.4) in the
centre, and sigmoidal of the form (3.5) on the right.
Linear Regression
64
Linear Regression with Basis

Functions
Solution
X =
1 - x1 - x2
1 - x3
=
Linear Regression
? x ? x2

? x3
P-
65
Linear Regression with Multiple

Outputs
Multiple outputs
- =
3-
X = X =
-
3-
1 -1 2
1 3-
=
3
-?
2?

3?
= N
Linear Regression
P-
-+
-?
=
?
66
Summary
Linear Regression (aka curve fitting)
Gradient Descent Approach for finding the solution
Analytical solution
Loss Functions
Probabilistic view of Linear Regression
Bias-Variance analysis
Regularization
Ridge Regression
Regression with basis functions

Locally weighted regression (refer ML - 8.3)
Linear Regression
67

w5 PDF

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

w5 PDF

Enviado por

Direitos autorais:

Formatos disponíveis

Linear Regression

CSL465/603 - Fall 2016

CSL465/603 - Machine Learning

Example - Green Chilies

Earnings from the

Cost of making the film (in crores of Rs)

CSL465/603 - Machine Learning

Goal: Learn function

CSL465/603 - Machine Learning

Earnings from the

Cost of making the film (in crores of Rs)

CSL465/603 - Machine Learning

Least Mean Squares - Cost

Earnings from the

Cost of making the film (in crores of Rs)

CSL465/603 - Machine Learning

Least Mean Squares - Cost

CSL465/603 - Machine Learning

Least Mean Squares - Cost

CSL465/603 - Machine Learning

Least Mean Squares - Cost

CSL465/603 - Machine Learning

Plot of the Error Surface

CSL465/603 - Machine Learning

Contour Plot of Error Surface

CSL465/603 - Machine Learning

Estimating Optimal Parameters

CSL465/603 - Machine Learning

Gradient Descent Basic

Start with an initial estimate for w

CSL465/603 - Machine Learning

Gradient Descent - Intuition

CSL465/603 - Machine Learning

Effect of Learning Parameter

Too large value

CSL465/603 - Machine Learning

Gradient Descent Local Minima

CSL465/603 - Machine Learning

Gradient Descent for Regression

Geometrically error surface is bowl shaped.

Exercise Prove that the sum of

CSL465/603 - Machine Learning

Parameter Update (1)

CSL465/603 - Machine Learning

Parameter Update (2)

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

Will talk stochastic gradient descent later (neural

CSL465/603 - Machine Learning

Multivariate Linear Regression

Age of the Earnings

Dimension of the input data -

CSL465/603 - Machine Learning

Multivariate Linear Regression Formulation

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

Feature Scaling Multivariate

Transform features to be of same scale