Introduction Linear Regression

17.
Introduction to
Regression
Parameter Estimation
Pitfalls
17.2
Regression
Often in engineering we are interested in the
relationship between two variables
– If we want to understand what factors influence a
variable of interest
– In the absence of actual measurements we can use
other information to infer what is the likely value.
– Compare a observed data to simulated or modelled
data
Often due to randomness and other unknown
factors the relationship may not be unique
Relationship is presented in form of scattergram
17.3
Scattergrams
17.4
Scattergrams
17.5
Scattergrams
17.6
Scattergram
17.7
Regression Analysis
If the relationship between these variables is not
unique it requires probabilistic description
Use technique of Regression Analysis to
determine the mean and variance of one
variable as function of another variable
If the function is simply a linear function it is
called Linear Regression
More generally regression may be nonlinear, but
this is beyond scope of this course.
17.8
Linear Regression
Range of possible values
of Y increases with X
However, knowing value
of X does not give perfect
information in Y
Range of values could be
covered by a probability
distribution
Mean values of Y
increase with increasing
values of x
If this relationship is
linear, we have linear
regression:
E (Y | X = x) = α + β x
17.9
Linear Regression
E (Y | X = x) = α + β x
Regression equation,
represents regression
of Y on X
α (intercept) and β
(slope) are constants
known as regression
coefficients
Estimated from the
data
17.10
Linear Regression
E (Y | X = x) = α + β x
From scatter, expect

variance of Y, depend
on X (general)
First, lets consider
case where
Var(Y|X=x) is constant
Estimated from the
data
17.11
Linear Regression – Parameter Estimation

“Best” straight line is one that E (Y | X = x) = α + β x
passes through data points with
least error
Each data point (xi,yi) error
between observed yi
and from straight line, yi’= α+ βxi
is |yi’ –yi|
For several data points, total error
is represented by cumulative
squared error
n
Δ = ∑ yi' − yi
2 2
i =1
n
= ∑ ( yi − α − β xi )
2
i =1
Obtain straight line with least

squared error by minimising Δ 2
Method of Least Squares
17.12
Linear Regression – Least Squares

Parameter Estimates
Parameter Estimates: E (Y | X = x) = α + β x
αˆ = y − βˆ x
βˆ = ∑ ( x y ) − nxy
i i
∑ ( x ) − nx
2
i
2
x , y = sample mean of X and Y

n = sample size
Least Squares
Regression Equation:
E (Y | x) = αˆ + βˆ x
17.13
Example
Water resource management in agricultural regions
predictions uses management models
– Determine water allocations, cropping regime’s etc.
An important input is the crop water use
Management models run over periods of many years
Need crop water use every year
Crop water use is difficult to measure
– Not directly available annually
– Can be related to crop yield, available annually
– Higher Crop Yield=Crops Use More Water
– A study from 1960’s measured crop water use and compared it
to crop yield over several farms over a number of years
17.14
Example – Regression Analysis

Sugar Cane Yield Water Use
(t/hectare) (mm)
300
[x] [y]
150.0 60.0 250
120.0 62.0
200
Water Use (mm)

129.4 72.0
140.0 82.0 150
149.0 88.0
100
120.0 92.0
137.7 102.0 50
153.9 103.0
158.9 108.0 0
100.0 150.0 200.0 250.0 300.0 350.0
125.0 116.0
Sugar Cane Yield (t/hectare)
200.0 170.0
• Using regression model to predict water use

based on annual sugar cane yield
E[Water Use | Yield ] = αˆ + βˆ * Yield [ x]
Develop a regression model to predict crop water use based on sugar cane yield17.15
E[Water Use | Yield ] = αˆ + βˆ * Yield

yˆ = αˆ + βˆ * x
αˆ = y − βˆ x
βˆ = ∑ ( x y ) − nxy
i i
n
Δ 2 = ∑ ( yi − α − β xi )
2
sY2|x =
Δ 2
r2 = 1−
sY2|x
∑ ( x ) − nx
2
i
2
i =1 n−2 sY2

(t/hectare) (mm) xy x2 y(pred) [y(obs)-y(pred)]2
[x] [y]
150.0 60.0 9000.0 22500.0 101.6 1731.9
120.0 62.0 7440.0 14400.0 73.1 123.3
129.4 72.0 9319.7 16754.8 82.1 101.5
140.0 82.0 11480.0 19600.0 92.1 102.3
149.0 88.0 13112.0 22201.0 100.7 160.4
120.0 92.0 11040.0 14400.0 73.1 357.1
137.7 102.0 14044.7 18959.5 89.9 145.9
153.9 103.0 15849.1 23677.5 105.3 5.3
158.9 108.0 17165.1 25260.8 110.1 4.5
125.0 116.0 14500.0 15625.0 77.9 1455.1
200.0 170.0 34000.0 40000.0 149.1 435.1
Sum 1583.9 1055.0 156950.7 233378.7 4622.4
Average 144.0 95.9
Variance 1034.98
Beta(hat) 0.95
Alpha(hat) -40.95
Conditional Var 513.60
R2 0.50
Example 17.1
17.16
Example – Regression Analysis

200.0
(t/hectare) (mm) 180.0
[x] [y] 160.0

y = 0.9505x - 40.954
150.0 60.0 R2 = 0.5087
Water Use (mm)

140.0
120.0 62.0
129.4 72.0 120.0
140.0 82.0 100.0

149.0 88.0
80.0
120.0 92.0
137.7 102.0 60.0
153.9 103.0 40.0
158.9 108.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0
125.0 116.0 Sugar Cane Yield (t/hectare)
200.0 170.0
• Regression model to predict water use based

on annual sugar cane yield
17.17
Example – Regression Analysis in EXCEL

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.713248241
R Square 0.508723053
Adjusted R Square 0.454136725
Standard Error 22.66269606
Observations 11
ANOVA
df SS MS F
Regression 1 4786.528955 4786.528955 9.319606
Residual 9 4622.380136 513.5977929
Total 10 9408.909091
Coefficients Standard Error t Stat P-value

Intercept -40.95410879 45.34972022 -0.90307302 0.390017
Sugar Cane Yield (t/hectare) 0.950471558 0.311343895 3.052802937 0.013731
Tools | Data Analysis | Regression Analysis

Adjusted R Square takes into account number of
parameters – useful for multiple linear regression
Standard Error = Conditional Standard Deviation, SY|X,
sample estimate of error standard deviation, σe
17.18
Coefficient of determination, R2
R2 - measure of how well the regression model matches “fits” data
% of variance in Y that is explained by X
By knowing X, variance in Y is reduced by R2 x 100%
2
200.0 yˆ | x = 190 ~ N (139, 22)
180.0
yˆ | x = 150 ~ N (102, 22)
160.0
y = 0.9505x - 40.954
139
Water Use (mm)

R2 = 0.5087
Water Use (mm)
140.0
120.0
100.0 102
80.0
60.0
40.0
100.0 120.0 140.0 160.0 180.0 200.0 220.0
150 190
SD(Y)=32.2 mm
By knowing x=150, SD(Y|x=150) = 22 mm
By knowing x=190, SD(Y|x=150) = 22 mm
17.19
Coefficient of determination, R2
R2 - measure of model fit
% of variance in Y that is explained by X
(a)
Match the R2 value to the graph
(1) (a ) R 2 → 0, (b) R 2 → 1 (c) 1 < R 2 < 0
(2) (a ) R 2 → 1, (b) 1 < R 2 < 0 (c) R 2 → 0 (b)
(3) (a ) R → 1, (b) R → 0 (c) 1 < R < 0

2 2 2
(c)
17.20
Summary
Regression Analysis
Probabilistic Relationship between two variable
Linear Regression
– Relationship is Linear
Develop Least Squares Estimates of Parameters
– Slope and Intercept
– Conditional Variance
Use this relationship to predict values of y given
x
17.21
Pitfalls of Regression
Extrapolating beyond observed data
Influence of Outliers
High R2 does not always imply a good model
Low R2 does not always imply there is no
relationship
Relationship developed from a regression
analysis does not necessarily imply any casual
effect between variables
– Need to ensure there is a physical processes
17.22
Extrapolation
Fitting your regression model
E (Y | X = x) = αˆ + βˆ x
Making predictions of y, using x values
outside the range of observed data
Regression of Power on Working Days
400
y = 15.518x - 100.52
350 R2 = 0.6839
Power (kWh)
300
250
200
20 21 22 23 24 25 26 27 28 29 30 31 32
Workings Days
17.23
Example
Water resource management in agricultural regions
predictions uses management models
– Determine water allocations, cropping regime’s etc.
An important input is the crop water use
Management models run over periods of many years
Need crop water use every year
Crop water use is difficult to measure
– Not directly available annually
– Can be related to crop yield, available annually
– Higher Crop Yield=Crops Use More Water
– 1960’s study measured crop water use and compared it to crop
yield over several farms over a number of years
Regression model of Water Use = F(Crop Yield) applied
in 2000’s
17.24
Data from 1960’s

300
y = 0.5352x + 92.661
250
R2 = 0.5087
Water Use (mm)
200
150
100
50
0
0 50 100 150 200
• Using regression model to predict water use

based on annual sugar cane yield
17.25
Use of Model in 2000’s

• Increase in Crop Yields from 1960-2000, up to 300 t/hectare
• Require a prediction of water use with increase crop yields
300
250 ?
Water Use (mm)
200
150
100
y = 0.5352x + 92.661
50
R2 = 0.5087
0
0 50 100 150 200 250 300 350
17.26
Choose a suitable trendline to estimate water use with a

sugar cane yield of 300 t/hectare
(a) (b)
300
300
250
250
Water Use (mm)
200
Water Use (mm)

200
150
150
100
100
50
50
0
0
0 50 100 150 200 250 300 350
0 50 100 150 200 250 300 350
(c) (d)
300
300
250 250
Water Use (mm)
Water Use (mm)

200 200
150 150
100 100
50 50
0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Sugar Cane Yield (t/hectare) Sugar Cane Yield (t/hectare)

17.27
Trendline actually used

(a)
300
250
Water Use (mm)
200
150
100
50
0
0 50 100 150 200 250 300 350
17.28
Influence of Outliers
300
250
~40%
Water Use (mm)
200
150
Regression
100
line with
50
largest
value
0 removed
0 50 100 150 200 250 300 350
• Outliers can exert a large influence on regression equation

• Be wary of Extrapolation
• Regression equation may not provide reliable predictions outside range
of calibration data – need to think about physical processes!
17.29
High R2 does not imply a good

model
Data on right will still have reasonable R2 but is

not good model
17.30
Low R2 does not imply there is no

relationship between X and Y
17.31
Summary
Meaning of Coefficient of Determination, R2
Pitfalls of Regression
– Dangers of Extrapolating
– Influence of Outliers
– Consider physical processes when assessing model
adequacy
Next: More Regression
– Probabilistic Predictions using Regression
– Multiple Linear Regression
– Nonlinear Regression
– Residual Analysis
– Verify error model assumptions

Introduction Linear Regression

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Introduction Linear Regression

Enviado por

Direitos autorais:

Formatos disponíveis

17.

From scatter, expect

Linear Regression – Parameter Estimation

Obtain straight line with least

Linear Regression – Least Squares

x , y = sample mean of X and Y

Example – Regression Analysis

Water Use (mm)

• Using regression model to predict water use

E[Water Use | Yield ] = αˆ + βˆ * Yield

Sugar Cane Yield Water Use

Example – Regression Analysis

[x] [y] 160.0

Water Use (mm)

140.0 82.0 100.0

• Regression model to predict water use based

Example – Regression Analysis in EXCEL

Coefficients Standard Error t Stat P-value

Tools | Data Analysis | Regression Analysis

Water Use (mm)

(1) (a ) R 2 → 0, (b) R 2 → 1 (c) 1 < R 2 < 0

(2) (a ) R 2 → 1, (b) 1 < R 2 < 0 (c) R 2 → 0 (b)

(3) (a ) R → 1, (b) R → 0 (c) 1 < R < 0

Data from 1960’s

• Using regression model to predict water use

Use of Model in 2000’s

Choose a suitable trendline to estimate water use with a

Water Use (mm)

Water Use (mm)

Sugar Cane Yield (t/hectare) Sugar Cane Yield (t/hectare)

Trendline actually used

• Outliers can exert a large influence on regression equation

High R2 does not imply a good

Data on right will still have reasonable R2 but is

Low R2 does not imply there is no

Você também pode gostar