Você está na página 1de 31

17.

Introduction to
Regression

Parameter Estimation
Pitfalls
17.2

Regression
Often in engineering we are interested in the
relationship between two variables
– If we want to understand what factors influence a
variable of interest
– In the absence of actual measurements we can use
other information to infer what is the likely value.
– Compare a observed data to simulated or modelled
data
Often due to randomness and other unknown
factors the relationship may not be unique
Relationship is presented in form of scattergram
17.3

Scattergrams
17.4

Scattergrams
17.5

Scattergrams
17.6

Scattergram
17.7

Regression Analysis
If the relationship between these variables is not
unique it requires probabilistic description
Use technique of Regression Analysis to
determine the mean and variance of one
variable as function of another variable
If the function is simply a linear function it is
called Linear Regression
More generally regression may be nonlinear, but
this is beyond scope of this course.
17.8

Linear Regression
Range of possible values
of Y increases with X
However, knowing value
of X does not give perfect
information in Y
Range of values could be
covered by a probability
distribution
Mean values of Y
increase with increasing
values of x
If this relationship is
linear, we have linear
regression:
E (Y | X = x) = α + β x
17.9

Linear Regression
E (Y | X = x) = α + β x

Regression equation,
represents regression
of Y on X
α (intercept) and β
(slope) are constants
known as regression
coefficients
Estimated from the
data
17.10

Linear Regression
E (Y | X = x) = α + β x

From scatter, expect


variance of Y, depend
on X (general)
First, lets consider
case where
Var(Y|X=x) is constant
Estimated from the
data
17.11

Linear Regression – Parameter Estimation


“Best” straight line is one that E (Y | X = x) = α + β x
passes through data points with
least error
Each data point (xi,yi) error
between observed yi
and from straight line, yi’= α+ βxi
is |yi’ –yi|
For several data points, total error
is represented by cumulative
squared error
n
Δ = ∑ yi' − yi
2 2

i =1
n
= ∑ ( yi − α − β xi )
2

i =1

Obtain straight line with least


squared error by minimising Δ 2
Method of Least Squares
17.12

Linear Regression – Least Squares


Parameter Estimates
Parameter Estimates: E (Y | X = x) = α + β x

αˆ = y − βˆ x

βˆ = ∑ ( x y ) − nxy
i i

∑ ( x ) − nx
2
i
2

x , y = sample mean of X and Y


n = sample size

Least Squares
Regression Equation:

E (Y | x) = αˆ + βˆ x
17.13

Example
Water resource management in agricultural regions
predictions uses management models
– Determine water allocations, cropping regime’s etc.
An important input is the crop water use
Management models run over periods of many years
Need crop water use every year
Crop water use is difficult to measure
– Not directly available annually
– Can be related to crop yield, available annually
– Higher Crop Yield=Crops Use More Water
– A study from 1960’s measured crop water use and compared it
to crop yield over several farms over a number of years
17.14

Example – Regression Analysis


Sugar Cane Yield Water Use
(t/hectare) (mm)
300
[x] [y]
150.0 60.0 250

120.0 62.0
200

Water Use (mm)


129.4 72.0
140.0 82.0 150
149.0 88.0
100
120.0 92.0
137.7 102.0 50
153.9 103.0
158.9 108.0 0
100.0 150.0 200.0 250.0 300.0 350.0
125.0 116.0
Sugar Cane Yield (t/hectare)
200.0 170.0

• Using regression model to predict water use


based on annual sugar cane yield
E[Water Use | Yield ] = αˆ + βˆ * Yield [ x]
Develop a regression model to predict crop water use based on sugar cane yield17.15

E[Water Use | Yield ] = αˆ + βˆ * Yield


yˆ = αˆ + βˆ * x
αˆ = y − βˆ x

βˆ = ∑ ( x y ) − nxy
i i
n
Δ 2 = ∑ ( yi − α − β xi )
2
sY2|x =
Δ 2
r2 = 1−
sY2|x

∑ ( x ) − nx
2
i
2
i =1 n−2 sY2

Sugar Cane Yield Water Use


(t/hectare) (mm) xy x2 y(pred) [y(obs)-y(pred)]2
[x] [y]
150.0 60.0 9000.0 22500.0 101.6 1731.9
120.0 62.0 7440.0 14400.0 73.1 123.3
129.4 72.0 9319.7 16754.8 82.1 101.5
140.0 82.0 11480.0 19600.0 92.1 102.3
149.0 88.0 13112.0 22201.0 100.7 160.4
120.0 92.0 11040.0 14400.0 73.1 357.1
137.7 102.0 14044.7 18959.5 89.9 145.9
153.9 103.0 15849.1 23677.5 105.3 5.3
158.9 108.0 17165.1 25260.8 110.1 4.5
125.0 116.0 14500.0 15625.0 77.9 1455.1
200.0 170.0 34000.0 40000.0 149.1 435.1
Sum 1583.9 1055.0 156950.7 233378.7 4622.4
Average 144.0 95.9
Variance 1034.98

Beta(hat) 0.95
Alpha(hat) -40.95
Conditional Var 513.60
R2 0.50

Example 17.1
17.16

Example – Regression Analysis


200.0
Sugar Cane Yield Water Use
(t/hectare) (mm) 180.0

[x] [y] 160.0


y = 0.9505x - 40.954
150.0 60.0 R2 = 0.5087

Water Use (mm)


140.0
120.0 62.0
129.4 72.0 120.0

140.0 82.0 100.0


149.0 88.0
80.0
120.0 92.0
137.7 102.0 60.0
153.9 103.0 40.0
158.9 108.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0
125.0 116.0 Sugar Cane Yield (t/hectare)
200.0 170.0

• Regression model to predict water use based


on annual sugar cane yield
17.17

Example – Regression Analysis in EXCEL


SUMMARY OUTPUT

Regression Statistics
Multiple R 0.713248241
R Square 0.508723053
Adjusted R Square 0.454136725
Standard Error 22.66269606
Observations 11

ANOVA
df SS MS F
Regression 1 4786.528955 4786.528955 9.319606
Residual 9 4622.380136 513.5977929
Total 10 9408.909091

Coefficients Standard Error t Stat P-value


Intercept -40.95410879 45.34972022 -0.90307302 0.390017
Sugar Cane Yield (t/hectare) 0.950471558 0.311343895 3.052802937 0.013731

Tools | Data Analysis | Regression Analysis


Adjusted R Square takes into account number of
parameters – useful for multiple linear regression
Standard Error = Conditional Standard Deviation, SY|X,
sample estimate of error standard deviation, σe
17.18

Coefficient of determination, R2
R2 - measure of how well the regression model matches “fits” data
% of variance in Y that is explained by X
By knowing X, variance in Y is reduced by R2 x 100%
2
200.0 yˆ | x = 190 ~ N (139, 22)
180.0
yˆ | x = 150 ~ N (102, 22)
160.0
y = 0.9505x - 40.954
139

Water Use (mm)


R2 = 0.5087
Water Use (mm)

140.0

120.0

100.0 102

80.0

60.0

40.0
100.0 120.0 140.0 160.0 180.0 200.0 220.0
150 190
Sugar Cane Yield (t/hectare)
Sugar Cane Yield (t/hectare)

SD(Y)=32.2 mm
By knowing x=150, SD(Y|x=150) = 22 mm
By knowing x=190, SD(Y|x=150) = 22 mm
17.19

Coefficient of determination, R2
R2 - measure of model fit
% of variance in Y that is explained by X
(a)
Match the R2 value to the graph

(1) (a ) R 2 → 0, (b) R 2 → 1 (c) 1 < R 2 < 0

(2) (a ) R 2 → 1, (b) 1 < R 2 < 0 (c) R 2 → 0 (b)

(3) (a ) R → 1, (b) R → 0 (c) 1 < R < 0


2 2 2

(c)
17.20

Summary
Regression Analysis
Probabilistic Relationship between two variable
Linear Regression
– Relationship is Linear
Develop Least Squares Estimates of Parameters
– Slope and Intercept
– Conditional Variance
Use this relationship to predict values of y given
x
17.21

Pitfalls of Regression
Extrapolating beyond observed data
Influence of Outliers
High R2 does not always imply a good model
Low R2 does not always imply there is no
relationship
Relationship developed from a regression
analysis does not necessarily imply any casual
effect between variables
– Need to ensure there is a physical processes
17.22

Extrapolation
Fitting your regression model
E (Y | X = x) = αˆ + βˆ x
Making predictions of y, using x values
outside the range of observed data
Regression of Power on Working Days
400
y = 15.518x - 100.52
350 R2 = 0.6839
Power (kWh)

300

250

200
20 21 22 23 24 25 26 27 28 29 30 31 32
Workings Days
17.23

Example
Water resource management in agricultural regions
predictions uses management models
– Determine water allocations, cropping regime’s etc.
An important input is the crop water use
Management models run over periods of many years
Need crop water use every year
Crop water use is difficult to measure
– Not directly available annually
– Can be related to crop yield, available annually
– Higher Crop Yield=Crops Use More Water
– 1960’s study measured crop water use and compared it to crop
yield over several farms over a number of years
Regression model of Water Use = F(Crop Yield) applied
in 2000’s
17.24

Data from 1960’s


300

y = 0.5352x + 92.661
250
R2 = 0.5087
Water Use (mm)

200

150

100

50

0
0 50 100 150 200
Sugar Cane Yield (t/hectare)

• Using regression model to predict water use


based on annual sugar cane yield
17.25

Use of Model in 2000’s


• Increase in Crop Yields from 1960-2000, up to 300 t/hectare
• Require a prediction of water use with increase crop yields

300

250 ?
Water Use (mm)

200

150

100

y = 0.5352x + 92.661
50
R2 = 0.5087

0
0 50 100 150 200 250 300 350
Sugar Cane Yield (t/hectare)
17.26

Choose a suitable trendline to estimate water use with a


sugar cane yield of 300 t/hectare
(a) (b)
300
300
250
250
Water Use (mm)

200

Water Use (mm)


200
150
150
100
100
50
50
0
0
0 50 100 150 200 250 300 350
0 50 100 150 200 250 300 350
Sugar Cane Yield (t/hectare)
Sugar Cane Yield (t/hectare)
(c) (d)
300
300

250 250
Water Use (mm)

Water Use (mm)


200 200

150 150

100 100

50 50

0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350

Sugar Cane Yield (t/hectare) Sugar Cane Yield (t/hectare)


17.27

Trendline actually used


(a)

300

250
Water Use (mm)

200

150

100

50

0
0 50 100 150 200 250 300 350
Sugar Cane Yield (t/hectare)
17.28

Influence of Outliers
300

250

~40%
Water Use (mm)

200

150

Regression
100
line with
50
largest
value
0 removed
0 50 100 150 200 250 300 350
Sugar Cane Yield (t/hectare)

• Outliers can exert a large influence on regression equation


• Be wary of Extrapolation
• Regression equation may not provide reliable predictions outside range
of calibration data – need to think about physical processes!
17.29

High R2 does not imply a good


model

Data on right will still have reasonable R2 but is


not good model
17.30

Low R2 does not imply there is no


relationship between X and Y
17.31

Summary
Meaning of Coefficient of Determination, R2
Pitfalls of Regression
– Dangers of Extrapolating
– Influence of Outliers
– Consider physical processes when assessing model
adequacy
Next: More Regression
– Probabilistic Predictions using Regression
– Multiple Linear Regression
– Nonlinear Regression
– Residual Analysis
– Verify error model assumptions

Você também pode gostar