Você está na página 1de 67

Linear Regression

CSL465/603 - Fall 2016


Narayanan C Krishnan
ckn@iitrpr.ac.in

Outline
Univariate regression
Multivariate regression
Probabilistic view of regression
Loss functions
Bias-Variance analysis
Regularization

Linear Regression

CSL465/603 - Machine Learning

Example - Green Chilies


Entertainment Company

Earnings from the


film (in crores of Rs)

Cost of making the film (in crores of Rs)


Linear Regression

CSL465/603 - Machine Learning

Notations
Training dataset
Number of examples -
Input variable - x#
Target variable - %

Goal: Learn function


that predicts for new
input x

Linear Regression

Cost of Film
(Crores of Rs) x

Profit/Loss
(Crores of Rs) y

98.28

199.69

40.22

93.69

62.07

100.33

CSL465/603 - Machine Learning

Linear Regression
Simplest form

(x) = + + - x

Earnings from the


film (in crores of Rs)

Cost of making the film (in crores of Rs)


Linear Regression

CSL465/603 - Machine Learning

Least Mean Squares - Cost


Function
Choose parameters + and - (or w ) so that x is
as close as to

Earnings from the


film (in crores of Rs)

Cost of making the film (in crores of Rs)


Linear Regression

CSL465/603 - Machine Learning

Least Mean Squares - Cost


Function - Parameter Space (1)
Let
+ , - =

Linear Regression

23

3
%8- 6 6

CSL465/603 - Machine Learning

Least Mean Squares - Cost


Function - Parameter Space (2)
Let
+ , - =

Linear Regression

23

3
%8- % %

CSL465/603 - Machine Learning

Least Mean Squares - Cost


Function - Parameter Space (3)
Let
+ , - =

Linear Regression

23

3
%8- 6 6

CSL465/603 - Machine Learning

Plot of the Error Surface

Linear Regression

CSL465/603 - Machine Learning

10

Contour Plot of Error Surface

Linear Regression

CSL465/603 - Machine Learning

11

Estimating Optimal Parameters

Linear Regression

CSL465/603 - Machine Learning

12

Gradient Descent Basic


Principle
Minimize w =

3
23 %8-

x% %

Start with an initial estimate for w


Keep changing w so that w is progressively
reduced
Stop when no change or have reached the
minimum

Linear Regression

CSL465/603 - Machine Learning

13

Gradient Descent - Intuition

Linear Regression

CSL465/603 - Machine Learning

14

Effect of Learning Parameter


Too small value slow
convergence

Linear Regression

Too large value


oscillates widely and
may not converge

CSL465/603 - Machine Learning

15

Gradient Descent Local Minima


Depending on the function w , gradient descent
can get stuck at local minima

Linear Regression

CSL465/603 - Machine Learning

16

Gradient Descent for Regression


Convex error function
3
1
w =
; x# %
2

%8-

Geometrically error surface is bowl shaped.


Only global minima

Exercise Prove that the sum of


squared error is a convex function
Linear Regression

CSL465/603 - Machine Learning

17

Parameter Update (1)


Minimize

1
w =
; x# %
2

%8-

Linear Regression

CSL465/603 - Machine Learning

18

Parameter Update (2)


Repeat till convergence

1
+ = + ; x# %

%8-

1
- = - ; x# % x#

%8-

Linear Regression

CSL465/603 - Machine Learning

19

Example Iteration 0
Regression Function

Linear Regression

Error Function

CSL465/603 - Machine Learning

20

Example Iteration 1
Regression Function

Linear Regression

Error Function

CSL465/603 - Machine Learning

21

Example Iteration 2
Regression Function

Linear Regression

Error Function

CSL465/603 - Machine Learning

22

Example Iteration 4
Regression Function

Linear Regression

Error Function

CSL465/603 - Machine Learning

23

Example Iteration 7
Regression Function

Linear Regression

Error Function

CSL465/603 - Machine Learning

24

Example Iteration 9
Regression Function

Linear Regression

Error Function

CSL465/603 - Machine Learning

25

Gradient Descent
Batch Mode
Update includes contribution of all data points
3
1
+ = + ; x# %

%8-

1
- = - ; x# % x#

%8-

Will talk stochastic gradient descent later (neural


networks).
Linear Regression

CSL465/603 - Machine Learning

26

Multivariate Linear Regression


Cost of Film
(Crores of
Rs)

Celebrity
status of the
protagonist

# of theatres
release

Age of the Earnings


protagonist (Crores of Rs) y

75.72

7.57

32

52

157.39

18.74

1.87

16

68

81.93

50.96

5.09

27

35

131.95

Dimension of the input data -

Linear Regression

CSL465/603 - Machine Learning

27

Multivariate Linear Regression Formulation


Simplest model:
x = + + - - + 2 2 + + ? ?
Parameters to learn: + , - , , ? =
Cost function: =

23

3
%8- x# %

Update equation: B = B 3
%8- x% % %B
3

Linear Regression

CSL465/603 - Machine Learning

28

Gradient Descent
Parameter update equation
3
1
B = B ; x% % %B

%8-

Linear Regression

CSL465/603 - Machine Learning

29

Feature Scaling Multivariate


Linear Regression (1)
Cost of Film
(Crores of
Rs)

Celebrity
status of the
protagonist

# of theatres
release

Age of the
Profit/Loss
protagonist (Crores of Rs) y

75.72

7.57

32

52

157.39

18.74

1.87

16

68

81.93

50.96

5.09

27

35

131.95

Transform features to be of same scale


Linear Regression

CSL465/603 - Machine Learning

30

Feature Scaling for Multivariate


Linear Regression (2)
Normalization 1 D 1 or 0 D 1
Standardization mean 0 and standard deviation 1

Linear Regression

CSL465/603 - Machine Learning

31

Multivariate Linear Regression


Analytical Solution
Cost of Film
(Crores of
Rs)

Celebrity
status of the
protagonist

# of theatres
release

Age of the
Profit/Loss
protagonist (Crores of Rs) y

75.72

7.57

32

52

157.39

18.74

1.87

16

68

81.93

50.96

5.09

27

35

131.95

Design Matrix and Target Vector

X=

Linear Regression

1 -1 2

1 3-

-?
2?

3?
CSL465/603 - Machine Learning

32

Least Squares Method


X = X =

1 -1 2

1 3-

-?
2?

3?

1
= ; x# %
2

%8-

Linear Regression

CSL465/603 - Machine Learning

33

Normal Equations
1
min = min X N X
L
L 2
Finding the gradient wrt and equate it to 0

Linear Regression

CSL465/603 - Machine Learning

34

Analytical Solution
Advantage
No need for the learning parameter !
No need for iterative updates

Disadvantage
Need to perform matrix inversion

Pseudo-Inverse of the matrix N P- N


Sometimes we deal with non-invertible matrices (redundant
features)

Linear Regression

CSL465/603 - Machine Learning

35

Probabilistic View of Linear


Regression (1)
Let = () +
is the error term that captures unmodeled effects
or random noise.
~ 0, 2 - Gaussian distribution

Linear Regression

CSL465/603 - Machine Learning

36

Probabilistic View of Linear


Regression (2)
Let = () +
is the error term that captures unmodeled effects
or random noise.
~ 0, 2 - Gaussian distribution - why?

0, 2 has maximum
entropy among all real-valued
distributions with a specified
variance 2
3- rule:

Linear Regression

CSL465/603 - Machine Learning

37

Probabilistic View of Linear


Regression (3)
Let = (x) +
is the error term that captures unmodeled effects
or random noise.
~ 0, 2 - Gaussian distribution

Then
=
And
x =

Figure 1.28

The regression function y(x),


which minimizes the expected
squared loss, is given by the
mean of the conditional distribution p(t|x).

t
y(x)
y(x0 )
p(t|x0 )

x0

Linear Regression

47

1.5. Decision Theory

which is the conditional average of t conditioned on x and is known as the regression


38
- Machine
Learningin Figure 1.28. It can readily be extended to mulfunction.CSL465/603
This result
is illustrated

Probabilistic View of Linear


Regression (4)
- , , 3 x- , , x3 =

- , , 3 x- , , x3 ; =

Linear Regression

CSL465/603 - Machine Learning

39

Maximizing the Likelihood


Maximize = 3
%8- % |x% ;

Linear Regression

CSL465/603 - Machine Learning

40

Loss Functions
Squared loss 2
Absolute loss
Dead band loss max 0, , ]

Linear Regression

CSL465/603 - Machine Learning

41

Loss Functions
Problem with squared loss

Linear Regression

CSL465/603 - Machine Learning

42

Linear Regression with


Absolute Loss Function
Objective

min ; X
L

%8-

Non-differentiable, so cannot take the gradient


descent approach
Solution: frame as a constrained optimization
problem
Introduce new variables 3 , % x% %
3

min ; % , subject to % x% % %
L,`

Linear Regression

%8CSL465/603 - Machine Learning

43

Linear Regression with


Absolute Loss Function - Example
LMS output

Linear Regression

LP output

CSL465/603 - Machine Learning

44

Some Additional Notations


Underlying response function (Target Concept)
Actual observed response = x +
~ 0, 2 , E /x = (x)

Predicted response based on the model learned


from dataset - x;
Expected response averaged over all datasets
m x = En x;
Expected 2 error on a new test instance x Epqq = En x ; 2

Linear Regression

CSL465/603 - Machine Learning

45

Bias-Variance Analysis (1)

Linear Regression

CSL465/603 - Machine Learning

46

Bias-Variance Analysis (2)

Linear Regression

CSL465/603 - Machine Learning

47

Bias-Variance Analysis (3)


Root Mean Square Error

Linear Regression

CSL465/603 - Machine Learning

48

Bias-Variance Analysis (4)


9th degree polynomial fit with more sample data

Linear Regression

CSL465/603 - Machine Learning

49

Bias-Variance Analysis (5)


Expected square loss E = r x 2 x, x

Linear Regression

CSL465/603 - Machine Learning

50

Bias-Variance Analysis (6)


Expected square loss E = r x 2 x, x

Linear Regression

CSL465/603 - Machine Learning

51

Bias-Variance Analysis (7)


Relevant part of loss:
u x x

Linear Regression

x x

CSL465/603 - Machine Learning

52

Bias-Variance Analysis (8)


Relevant part of loss:
n ;

Linear Regression

CSL465/603 - Machine Learning

53

Bias-Variance Analysis (9)


Degree = 1

Degree = 4

Linear Regression

CSL465/603 - Machine Learning

54

Bias-Variance Analysis (10)


Bias term of the error En x; x

Measures how well our approximation architecture can fit


the data
Weak approximators will have high bias
Example low degree polynomials

Strong approximators will have low bias


Example high degree polynomials

Linear Regression

CSL465/603 - Machine Learning

55

Bias-Variance Analysis (11)


Variance term of the error
En x; En x;

No direct dependence on the target value

For a fixed size dataset


Strong approximators tend to have more variance
Small changes in the dataset can result in wide changes in the
predictors

Weak approximators tend to have less variance


Small changes in the dataset result in similar predictors

Variance disappears as

Linear Regression

CSL465/603 - Machine Learning

56

Bias-Variance Analysis (12)


Measuring Bias and Variance in practice
Bootstrap from the given dataset

Start with a complex approximator, and reduce the


complexity through regularization
Setting more coefficients/parameters to 0
Do Feature Selection

Reduces variance, but can increase bias.


Hopefully just sufficient to model the given data

Linear Regression

CSL465/603 - Machine Learning

57

Regularization
Central Idea: penalize over-complicated solutions
Linear regression minimizes
3

; x% %

%8-

Regularized regression minimizes


3

; x% %

%8-

Linear Regression

CSL465/603 - Machine Learning

58

Modified Solution
Solution for ordinary linear regression
1
min min N
z
L 2
= N P- N
Now for the regularized version which uses 2 norm
Ridge Regression
1
min min N + 2
z
L 2
= N +

P-

Exercise: derive the closed for solution for ridge regression with L2 regularizer
Linear Regression

CSL465/603 - Machine Learning

59

How to choose ?
Tradeoff between complexity vs. goodness of the fit
Solution 1: If we have lots of data
Generate multiple models
Use lots of test data to discard the bad models

Solution 2: With limited data


Use k- fold cross validation
Will discuss later

Linear Regression

CSL465/603 - Machine Learning

60

General Form of Regularizer


Term
3

; x% %
%8-

+ ; D

D8-

Quadratic/2 regularizer = 2
3.1. Linear Basis Function Models
Contours for the regularization term

q = 0.5
Figure 3.3

Linear Regression

q=1

q=2

145

q=4

Contours of the regularization term in (3.29) for various values of the parameter q.
CSL465/603 - Machine Learning

61

Special scenario = 1 - LASSO


Least Absolute Shrinkage and Selection Operator
?
2

Error Function: 3
x

%
%8- %
D8- D
For sufficiently large many of the coefficients
become 0 resulting in a sparse solution
146

3. LINEAR MODELS FOR REGRESSION

Figure 3.4 Plot of the contours


of the unregularized error function
(blue) along with the constraint region (3.30) for the quadratic regularizer q = 2 on the left and the lasso
regularizer q = 1 on the right, in
which the optimum value for the parameter vector w is denoted by w .
The lasso gives a sparse solution in
which w1 = 0.

w2

w2

w1

Linear Regression

w1

CSL465/603 - Machine Learning


For the remainder of this chapter
we shall focus on the quadratic regularizer

62

LASSO
Quadratic programming to solve the optimization
problem
Least Angles Regression solution - refer to ESL
http://web.stanford.edu/~hastie/glmnet_matlab/ matlab packages for LASSO

Linear Regression

CSL465/603 - Machine Learning

63

Linear Regression with NonLinear Basis Functions


Linear combination of fixed non-linear functions of
the input variables
?

x = + + ; D x

140

3. LINEAR MODELS FOR REGRESSION

D8-

0.5

0.75

0.75

0.5

0.5

0.5

0.25

0.25

1
1

0
1

Figure 3.1 Examples of basis functions, showing polynomials on the left, Gaussians of the form (3.4) in the
centre, and sigmoidal of the form (3.5) on the right.
Linear Regression

CSL465/603 - Machine Learning

64

Linear Regression with Basis


Functions
Solution
X =

1 - x1 - x2

1 - x3
=

Linear Regression

? x ? x2

? x3

P-

CSL465/603 - Machine Learning

65

Linear Regression with Multiple


Outputs
Multiple outputs

- =
3-

X = X =
-
3-

1 -1 2

1 3-

=
3

-?
2?

3?

= N
Linear Regression

P-

-+

-?

=
?

CSL465/603 - Machine Learning

66

Summary
Linear Regression (aka curve fitting)
Gradient Descent Approach for finding the solution
Analytical solution
Loss Functions
Probabilistic view of Linear Regression
Bias-Variance analysis
Regularization
Ridge Regression

Regression with basis functions


Locally weighted regression (refer ML - 8.3)
Linear Regression

CSL465/603 - Machine Learning

67

Você também pode gostar