Escolar Documentos
Profissional Documentos
Cultura Documentos
Introduction to
Regression
Parameter Estimation
Pitfalls
17.2
Regression
Often in engineering we are interested in the
relationship between two variables
– If we want to understand what factors influence a
variable of interest
– In the absence of actual measurements we can use
other information to infer what is the likely value.
– Compare a observed data to simulated or modelled
data
Often due to randomness and other unknown
factors the relationship may not be unique
Relationship is presented in form of scattergram
17.3
Scattergrams
17.4
Scattergrams
17.5
Scattergrams
17.6
Scattergram
17.7
Regression Analysis
If the relationship between these variables is not
unique it requires probabilistic description
Use technique of Regression Analysis to
determine the mean and variance of one
variable as function of another variable
If the function is simply a linear function it is
called Linear Regression
More generally regression may be nonlinear, but
this is beyond scope of this course.
17.8
Linear Regression
Range of possible values
of Y increases with X
However, knowing value
of X does not give perfect
information in Y
Range of values could be
covered by a probability
distribution
Mean values of Y
increase with increasing
values of x
If this relationship is
linear, we have linear
regression:
E (Y | X = x) = α + β x
17.9
Linear Regression
E (Y | X = x) = α + β x
Regression equation,
represents regression
of Y on X
α (intercept) and β
(slope) are constants
known as regression
coefficients
Estimated from the
data
17.10
Linear Regression
E (Y | X = x) = α + β x
i =1
n
= ∑ ( yi − α − β xi )
2
i =1
αˆ = y − βˆ x
βˆ = ∑ ( x y ) − nxy
i i
∑ ( x ) − nx
2
i
2
Least Squares
Regression Equation:
E (Y | x) = αˆ + βˆ x
17.13
Example
Water resource management in agricultural regions
predictions uses management models
– Determine water allocations, cropping regime’s etc.
An important input is the crop water use
Management models run over periods of many years
Need crop water use every year
Crop water use is difficult to measure
– Not directly available annually
– Can be related to crop yield, available annually
– Higher Crop Yield=Crops Use More Water
– A study from 1960’s measured crop water use and compared it
to crop yield over several farms over a number of years
17.14
120.0 62.0
200
βˆ = ∑ ( x y ) − nxy
i i
n
Δ 2 = ∑ ( yi − α − β xi )
2
sY2|x =
Δ 2
r2 = 1−
sY2|x
∑ ( x ) − nx
2
i
2
i =1 n−2 sY2
Beta(hat) 0.95
Alpha(hat) -40.95
Conditional Var 513.60
R2 0.50
Example 17.1
17.16
Regression Statistics
Multiple R 0.713248241
R Square 0.508723053
Adjusted R Square 0.454136725
Standard Error 22.66269606
Observations 11
ANOVA
df SS MS F
Regression 1 4786.528955 4786.528955 9.319606
Residual 9 4622.380136 513.5977929
Total 10 9408.909091
Coefficient of determination, R2
R2 - measure of how well the regression model matches “fits” data
% of variance in Y that is explained by X
By knowing X, variance in Y is reduced by R2 x 100%
2
200.0 yˆ | x = 190 ~ N (139, 22)
180.0
yˆ | x = 150 ~ N (102, 22)
160.0
y = 0.9505x - 40.954
139
140.0
120.0
100.0 102
80.0
60.0
40.0
100.0 120.0 140.0 160.0 180.0 200.0 220.0
150 190
Sugar Cane Yield (t/hectare)
Sugar Cane Yield (t/hectare)
SD(Y)=32.2 mm
By knowing x=150, SD(Y|x=150) = 22 mm
By knowing x=190, SD(Y|x=150) = 22 mm
17.19
Coefficient of determination, R2
R2 - measure of model fit
% of variance in Y that is explained by X
(a)
Match the R2 value to the graph
(c)
17.20
Summary
Regression Analysis
Probabilistic Relationship between two variable
Linear Regression
– Relationship is Linear
Develop Least Squares Estimates of Parameters
– Slope and Intercept
– Conditional Variance
Use this relationship to predict values of y given
x
17.21
Pitfalls of Regression
Extrapolating beyond observed data
Influence of Outliers
High R2 does not always imply a good model
Low R2 does not always imply there is no
relationship
Relationship developed from a regression
analysis does not necessarily imply any casual
effect between variables
– Need to ensure there is a physical processes
17.22
Extrapolation
Fitting your regression model
E (Y | X = x) = αˆ + βˆ x
Making predictions of y, using x values
outside the range of observed data
Regression of Power on Working Days
400
y = 15.518x - 100.52
350 R2 = 0.6839
Power (kWh)
300
250
200
20 21 22 23 24 25 26 27 28 29 30 31 32
Workings Days
17.23
Example
Water resource management in agricultural regions
predictions uses management models
– Determine water allocations, cropping regime’s etc.
An important input is the crop water use
Management models run over periods of many years
Need crop water use every year
Crop water use is difficult to measure
– Not directly available annually
– Can be related to crop yield, available annually
– Higher Crop Yield=Crops Use More Water
– 1960’s study measured crop water use and compared it to crop
yield over several farms over a number of years
Regression model of Water Use = F(Crop Yield) applied
in 2000’s
17.24
y = 0.5352x + 92.661
250
R2 = 0.5087
Water Use (mm)
200
150
100
50
0
0 50 100 150 200
Sugar Cane Yield (t/hectare)
300
250 ?
Water Use (mm)
200
150
100
y = 0.5352x + 92.661
50
R2 = 0.5087
0
0 50 100 150 200 250 300 350
Sugar Cane Yield (t/hectare)
17.26
200
250 250
Water Use (mm)
150 150
100 100
50 50
0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
300
250
Water Use (mm)
200
150
100
50
0
0 50 100 150 200 250 300 350
Sugar Cane Yield (t/hectare)
17.28
Influence of Outliers
300
250
~40%
Water Use (mm)
200
150
Regression
100
line with
50
largest
value
0 removed
0 50 100 150 200 250 300 350
Sugar Cane Yield (t/hectare)
Summary
Meaning of Coefficient of Determination, R2
Pitfalls of Regression
– Dangers of Extrapolating
– Influence of Outliers
– Consider physical processes when assessing model
adequacy
Next: More Regression
– Probabilistic Predictions using Regression
– Multiple Linear Regression
– Nonlinear Regression
– Residual Analysis
– Verify error model assumptions