Você está na página 1de 7

Lecture Notes on

Linear Regression
Laurenz Wiskott
Institut fur Neuroinformatik
Ruhr-Universitat Bochum, Germany, EU

29 January 2017

Contents
1 Problem definition 1
1.1 Perfect fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Minimum error objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Closed form solution 3

3 Extension to nonlinear regression 4

4 Multiple data points at the same location x 4

5 Quality assessment 5
5.1 Training error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.2 Generalization/testing error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.3 Parameter reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.3.1 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.3.2 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Sorry, these lecture notes lack references. This is all textbook knowledge.

1 Problem definition
Linear regression refers to the process of fitting a linear function
I
X
f (x) = w i xi = w T x (1)
i=1

to some data
(s , x ) {1, ..., M } , (2)

where the s are data values at positions x in the space of the possibly high-dimensional input variable x.
2009 Laurenz Wiskott (homepage https://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all figures from
other sources, if present) is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view
a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/. Figures from other sources have their own
copyright, which is generally indicated. Do not distribute parts of these lecture notes showing figures with non-free copyrights
(here usually figures I have the rights to publish but you dont, like my own published figures). Figures I do not have the rights
to publish are grayed out, but the word Figure, Image, or the like in the reference is often linked to a pdf.
More teaching material is available at https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/.

1
1.1 Perfect fit
As a first step we can write the objective of a perfect fit as a system of linear equations
I
X
s = f (x ) = wi xi , (3)
i=1

or in vector notation as
s = XT w , (4)
with

X := (x1 , ..., xM ) , (5)


1 M T
s := (s , ..., s ) . (6)

These equations generally have a solution iff M I if the x are all different. We can see that for (4) as
follows: First add (I M ) linearly independent column-vectors to X to make it a regular, i.e. invertable,
I I-matrix X; and add (I M ) arbitrary values to s to make it a I-dimensional vector s. A possible
1
solution is then w = X s. Many other solutions are possible for M < I by chosing different values for the
added s-values. For M = I the solution is unique.
If two x are identical but have different s then there cant be a solution.
Such a perfect fit is generally not possible, because in many cases M > I. A perfect fit is often not even
desirable, since data is normally noisy and a perfect fit indicates that there is too little data to constrain
the function reasonably well and one ends up doing overfitting. Thus, the more comon case is that a perfect
does not exist and one therefore has to find a different formulation of what a good solution is.

1.2 Minimum error objective


If a perfect fit does not exist, the objective is to at least minimize the mean squared error
M
1 X
Ew := (f (x ) s )2 (7)
2M =1

between the values of the data s and the function f (x) at the positions x .
With (1) we can write the objective or error function (7) more explicitely either in component notation
2
M I
1 X X
wj xj s ,
(7,1)
Ew = (8)
2M =1 j=1

or in vector notation
M
(7,1) 1 X T
Ew = (w x s )2 (9)
2M =1
M
1 X
= ((wT x )2 2(wT x )s + s 2 ) (10)
2M =1
M
1 X 2
= (s 2s x T w + wT x x T w) (11)
2M =1
1
2 PM
s 2 sxT w + wT xxT w 1



= (with hi := M =1 ) (12)
2
1
= a + bT w + wT Cw , (13)
2

2
with
1
2
a s ,
:= (14)
2
b := hsxi , (15)
C := xxT .


(16)

a is the mean squared data value. It is an irrelevant constant here, that does not depend on w. b can
be interpreted as a weighted average over the input values. C is the second-moment matrix of the input
values and plays an important role in many different contexts. But for us, a, b, and C are simple a scalar,
a vector, and a symmetric and positive semidefinite square matrix. The symmetry of C follows directly
from its definition. The fact that it is positive semidefinite is less trivial and shown in context of principal
component analysis, see the respective lecture notes.
Equation (13) is a quadratic form, as one would expect from a quadratic function (7) in a linear function (1).
It may only be surprising that the terms can be expressed so compactly in the x and s .

2 Closed form solution


The error function (7) can be minimized by standard optimization methods, such as gradient descent.
However, if we set the gradient to zero, we can even find a closed form solution. The first step is to calculate
the gradient of Ew with respect to the weights wi . We can do this either in component notation

M I
Ew (8) 1 X X
= x wj xj s (17)
wi M =1 i j=1
I M
! M
X 1 X 1 X
= xi xj wj x s (18)
j=1
M =1 M =1 i
I
X
1
PM
= hsxi i + hxi xj i wj (with hi := M =1 ) (19)
j=1

or, more conveniently, in vector notation


 T
Ew Ew Ew
:= , ..., (20)
w w1 wI
(13) 1
= b + (Cw + CT w) (21)
2
= b + Cw (since CT = C) . (22)

I leave it to the reader as an exercise to verify that (19) and (22) are indeed equivalent.
To get the optimal solution w we set the gradient to zero, which is generally a necessary and in this case
(since the error function is a quadratic form limited from below) also a sufficient condition.

! Ew (22)
0 = = b + Cw (23)
w w
w = C1 b . (24)

Note that a unique solution w only exists, if C is invertable, which is usually the case for many data points.
If there are too few data points, C becomes singular and no unique solution exists. However, in these cases
typically an exact solution exists for which the error vanishes, see section 1.1.

3
3 Extension to nonlinear regression
You might have come across linear regression in one scalar variable z, where you had to fit the function

g(z) = a + bz (25)

to some data (s , z ) (note that z is a scalar and not a vector).


Strictly speaking, f (z) is not a linear function but an affine function, i.e. a linear function plus an extra
constant term. The constant term would make the whole formalism cumbersome, and we therefore neglect
it in (1). However, the affine function can be transformed into a linear function if we add a component to
the input variable that always assumes the same value 1. Thus, if we define

x := (1, z)T , (26)


w1 := a , (27)
and w2 := b , (28)

we find that
(1) (26,27,28) (25)
f (x) = w1 x1 + w2 x2 = a + bz = g(z) . (29)
This means that if we have found the optimal function f (x) through true linear regression on the data
(s , x ), we have actually also solved the problem of fitting the affine function g(z) to the data (s , z ).
We have seen above that the constant in (25) makes g(z) in some sense a nonlinear function, but that this
can be dealt with by a mapping between the variables x and (1, z). In a similar move one can introduce
mappings such as x := (1, z, z 2 , z 3 )T and thereby find an optimal polynomial of degree three that fits the
data; similarly one could map x := (1, z1 , z2 , z12 , z1 z2 , z22 )T to find an optimal polynomial of degree if the
original problem had two independent variables. Thus, linear regression can be used for nonlinear regression
simply by mapping the nonlinear terms to new variables in which the function then becomes linear. More
on this in the lecture notes on nonlinear expansion.

4 Multiple data points at the same location x


Multiple data points at the same location x can occur, e.g., if the data are measurements and one simply
repeats the measurement for identical input values multiple times. To formalize this let us write the data as

(s , x ) {1, ..., M }, {1, ..., N } , (30)



where now indicates M different input values xP and indicates N different data points (s , x ) at this
input value. The total number of data points is N .

4
The objective function (7) can then be written as
N
M X
1 X
Ew = P (f (x ) s )2 (31)
2 0 N0 =1 =1
N
M X
1 X
= P ((f (x ) s ) + (s s ))2 (32)
2 0 N0 =1 =1
N
M X
1 X
(f (x ) s )2 + 2(f (x ) s )(s s ) + (s s )2

= P (33)
2 0 N0 =1 =1

M N N N
1 X
(f (x ) s )2
X X 0 X
= P 1 + 2(f (x ) s ) (s s ) + (s s )2
2 0 N0 =1 00 =1 0 =1 =1

(34)

M N
1 X X
(f (x ) s )2 N + 2(f (x ) s )N (s s ) + (s s )2
= P (35)
2 0 N0 =1
| {z } =1
=0
N
M
! M
!
1 N 1 N 1 X
P (f (x ) s )2 + P
X X
= (s s )2 (36)
2 =1 0 N0 2 =1 0 N0 N =1
M
! M
!
1X 2 1X
= n (f (x ) s ) + n var (s ) , (37)
2 =1 | {z } 2 =1 | {z }
squared distance of variance of s
f from s at x around s at x

with
N
normalized weighting factors n := P , (38)
0 N
0

N
1 X
and variances var (s ) := (s s )2 . (39)
N =1

This is an interesting result. The second term in (37) does not depend on the weights wi at all. It can serve
as a lower bound on the error, since it is not possible to go below that. As far as the regression is concerned
only the first term matters. This also means that by taking more measurements at any of the locations x
you cannot add constraints to the regression, you can only provide more accurate constraints.

5 Quality assessment
How can we tell whether a linear regression fit is good or not? Well, that depends on what we mean by
good, because that can mean different things.

Training error How well does the function fit the given data points?

Generalization/testing error How well will the function fit data measured in the future by the same
process?
Parameter reliability How accurately and how reliably are the parameters estimated?

We discuss these criteria in turn.

5
5.1 Training error
Training error (D: Trainingsfehler) is a term from learning theory and means the error on the data used for
training the system, in this case the data used for regression. This error is simply Ew as quantified by (7).
To a first approximation, the smaller Ew the better. But note that there is a lower limit given by the second
term in (37) and that too a low value may indicate overfitting and poor generalization.

5.2 Generalization/testing error


The generalization error (D: Generalisierungsfehler) is the error that one would get, if one had an infinite
amount of new data not used for training, on which one can test the system. This can typically not really
be evalutated for lack of data, but it can be estimated by testing the system on some testing data hold back
and not used for training. If this testing error is much larger than the training error, then this indicates
overfitting.

5.3 Parameter reliability


If it is not so much the fitted function that is of interest but the parameters of the function, then the
testing error does not help much. For instance, it could be that data are abundant but that a certain
parameter simply does not matter for the function in the domain where the data points lie. Then the value
of that parameter is pretty random even though the function is well constraint as far as the data points are
concerned.
There are different approaches to asses the reliability of the parameter estimation.

5.3.1 Sensitivity analysis

One possibility to assess the reliability of the parameter estimation is to check how the quality of the solution
degrades as one deviates from the optimal weight vector w . In particular it is interesting how different
parameters interact with each other. For instance it could be that changing either w1 or w2 has a big effect
on Ew but changing w1 and w2 simultaneously in the same direction has little effect.
We can get at this question by writing Ew in terms of a deviation w := w w from the optimal solution.

(13) 1
Ew = a + bT (w + w) + (w + w)T C(w + w) (40)
2
1 1 1 1
= a + b w + b w + w T Cw + w T Cw + wT Cw + wT Cw
T T
(41)
2 2 2 2
  T
1 T 1
= T
a + b w + w Cw + (b + Cw ) w + wT Cw (since C = CT ) (42)
2 | {z } 2
| {z } =0

=: Ew

1
= Ew + wT Cw . (43)
2
Not surprisingly, the linear term is gone. This is because the quadratic term is centered at w , because
otherwise the minimum would be somewhere else.
What is surprising, however, is that the effect of a deviation from the optimal solution does not depend on
the measured data values s , but only on the positions x at which we do the measurements. Thus the
sensitivity of the error function with respect to the weight vector can be determined before one actually
takes a single measurement.

6
5.3.2 Bootstrapping

If one has M data points available one can calculate one optimal weight vector w . What if we had a
different set of M data points? How much would the optimal weight vector vary as we get new sets of M
data points over and over again? Unfortunately, we only have the M data points given and we cannot answer
this question, even though it would be extremely helfpul to know something like the standard deviation of
the optimal weight vector.
To get at such information at least approximately one can use the technique of bootstrapping. In its simples
form of case resampling one repeatedly generates a set of new M data points by resampling from the given
data points with replacement and calculates the optimal weight vector from that. By doing this 10,000
times or so one gets a distribution of optimal weight vectors that reflects the uncertainty we have about w .
Note that since the weight vector is multidimensional, it is not appropriate to characterize the uncertainty
by standard deviations of the individual components of w . Instead one should use (the eigenvectors and
-values of) the second moment matrix.

Você também pode gostar