Você está na página 1de 42

Session 1

Applied Regression -- Prof. Juran 2


Outline for Session 1
Course Objectives & Description
Review of Basic Statistical Ideas
Intercept, Slope, Correlation, Causality
Simple Linear Regression
Statistical Model and Concepts
Regression in Excel
Applied Regression -- Prof. Juran 3
Learn useful and practical tools of
regression and data analysis
Learn by example and by doing
Learn enough theory to use regression
safely
Course Themes
Applied Regression -- Prof. Juran 4
Shape the course experience to meet
your goals
The agenda is flexible
Pick your own project
The professor also enjoys learning
Lets enjoy ourselves life is too short
Applied Regression -- Prof. Juran 5
Basic Information
www.columbia.edu/~dj114/b8831.htm

Teaching Assistant:
Adam Klappholz (D12)

Applied Regression -- Prof. Juran 6
Basic Requirements
Come to class and participate
Cases about once per week
Project
Applied Regression -- Prof. Juran 7
What is Regression Analysis?
A Procedure for Data Analysis
Regression analysis is a family of
mathematical procedures for fitting
functions to data.
The most basic procedure -- simple linear
regression -- fits a straight line to a set of
data so that the sum of the squared y
deviations is minimal. Regression can be
used on a completely pragmatic basis.

Applied Regression -- Prof. Juran 8
What is Regression Analysis?
A Foundation for Statistical Inference
If special statistical conditions hold, the
regression analysis:
Produces statistically best estimates of the
true underlying relationship and its
components
Provides measures of the quality and reliability
of the fitted function
Provides the basis for hypothesis tests and
confidence and prediction intervals
Applied Regression -- Prof. Juran 9
Some Regression Applications
Determining the factors that influence energy consumption in a detergent
plant
Measuring the volatility of financial securities
Determining the influence of ambient launch temperature on Space Shuttle
o-ring burn through.
Identifying demographic and purchase history factors that predict high
consumer response to catalog mailings
Mounting a legal defense against a charge of sex discrimination in pay.
Determining the cause of leaking antifreeze bottles on a packing line.
Measuring the fairness of CEO compensation
Predicting monthly champagne sales
Applied Regression -- Prof. Juran 10
Course Outline
Basics of regression
Bottom: inferences about effects of
independent variables on the dependent
variable
Middle: Analysis of Variance
Top: summary measures for the model
Applied Regression -- Prof. Juran 11
Course Outline
Advanced Regression Topics
Interval Estimation
Full Model with Arrays
Qualitative Variables
Residual Analysis
Thoughts on Nonlinear Regression
Model-building Ideas
Multicollinearity
Autocorrelation, serial correlation
Applied Regression -- Prof. Juran 12
Course Outline
Related Topics
Chi-square Goodness-of-Fit Tests
Forecasting Methods
Exponential Smoothing
Regression
Two Multivariate Methods
Cluster Analysis
Discriminant Analysis
Binary Logistic Regression


Applied Regression -- Prof. Juran 13
The Theory Underlying
Simple Linear Regression
Regression can always be used to fit a straight line
to a set of data. It is a relatively easy computational
task (Excel, Minitab, etc.) .
If specified conditions hold, statistical theory can be
employed to evaluate the quality and reliability of
the line - for prediction of future events.
Applied Regression -- Prof. Juran 14
The Standard Statistical Model
Y: the dependent random variable, the effect or
outcome that we wish to predict or understand.
X: the independent deterministic variable, an
input, cause or determinant that may cause,
influence, explain or predict the values of Y.
c | | + + = X X Y
1 0
) (
The parameters of the true regression relationship A random noise factor
The independent deterministic variable The dependent random variable
Applied Regression -- Prof. Juran 15
Regression Assumptions
The expected value of Y is a linear function of X:



The variance of Y does not change with X:
0 ) ( , ) (
1 0
= + = c | | E X X EY
2 2
) ( , ) ( o c o = = Var X VarY
Applied Regression -- Prof. Juran 16
Regression Assumptions
Random variations at different X values are
uncorrelated:


Random variations from the regression line are
normally distributed:


j i Cov
j i
, , 0 ) , ( = c c
) , 0 ( ~ ), , ~ ) (
2 2
1
o c o | | N X X Y +
0
N(
Applied Regression -- Prof. Juran 17
Thoughts on Linearity
The significance of the word linear in the linear
regression model


is not linearity in the Xs, it is linearity in the Betas (the
slope coefficients). Consider the following variants both of
which are linear:

2 1 4 2 3
2
1 2 1 1 0
) ( X X X X X X Y | | | | | + + + + =
p p
X X X X Y | | | | + + + =
2 2 1 1 0
) (
p p
X X X X Y | | | | + + + =
2 2 1 1 0
) ( ln
Applied Regression -- Prof. Juran 18
There are many creative ways to fit non-linear functions
by linear regression. Consider a few popular
linearizations:

X
Y
Y
e
e
Y
X Y
X
X
Y
X Y e Y
X Y X Y
X
X
X
| o
| o
| o
| o o
| o o
| o
| o
|
|
+ =
+
=
=

=
+ = =
+ = =
+
+
1
ln
1
1 1
ln ln
log log log
Time permitting, we will look at some of these
possibilities later in the course. These may present
interesting opportunities for student term projects.

Applied Regression -- Prof. Juran 19
Regression Estimators
i 1 2 . . . i . . . n
Y y
1
y
2
. . . y
i
. . . y
n
X x
1
x
2
. . . x
i
. . . x
n
We are given the data set:
We seek g ood estimators
0

| of
0
| and
1

| of
1
| that minimize the sums of the
squared residuals (errors). The i
th
residual is
n i x y e
i i i
,..., 2 , 1 ),

(
1 0
= + = | |

Applied Regression -- Prof. Juran 20
Computer Repair Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
A B C
Minutes Units
1 23 1
2 29 2
3 49 3
4 64 4
5 74 4
6 87 5
7 96 6
8 97 6
9 109 7
10 119 8
11 149 9
12 145 9
13 154 10
14 166 10
Applied Regression -- Prof. Juran 21
Statistical Basics
Basic statistical computations and graphical displays are very helpful in
doing and interpreting a regression. We should always compute:
n
x
x and
n
y
y
n
i
i
n
i
i
= =
= =
1 1
1
) (
1
) (
1
2
1
2

=

= =
n
x x
s and
n
y y
s
n
i
i
x
n
i
i
y



=
2 2
,
) ( ) (
) )( (
x x y y
x x y y
r
i i
i i
Y X
Applied Regression -- Prof. Juran 22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
A B C D E F G H I J K L
Minutes Units Error (min) Error (units)
1 23 1 -74.21 -5
2 29 2 -68.21 -4
3 49 3 -48.21 -3
4 64 4 -33.21 -2
5 74 4 -23.21 -2
6 87 5 -10.21 -1
7 96 6 -1.21 0
8 97 6 -0.21 0
9 109 7 11.79 1
10 119 8 21.79 2
11 149 9 51.79 3
12 145 9 47.79 3
13 154 10 56.79 4
14 166 10 68.79 4
mean 97.21 6
stdev 46.22 2.96
count 14 14
correl 0.9937
covar 126.29
correl 0.9937 Book method
covar 136 B6014 method
covar 136 Book method
=AVERAGE(C$2:C$15)
=STDEV(C$2:C$15)
=COUNT(C$2:C$15)
=CORREL(B$2:B$15,C$2:C$15)
=COVAR(B$2:B$15,C$2:C$15)
=SUMPRODUCT(E2:E15,F2:F15)/SQRT(SUMPRODUCT(E2:E15,E2:E15)*SUMPRODUCT(F2:F15,F2:F15))
=B20*(B18*C18)
=SUMPRODUCT(E2:E15,F2:F15)/(B19-1)
Applied Regression -- Prof. Juran 23
Graphical Analysis
We should always plot
histograms of the y and x values,
a time order plot of x and y (if appropriate)
and
a scatter plot of y on x.
Applied Regression -- Prof. Juran 24
Minutes
0
1
2
3
4
0 25 50 75 100 125 150 175 200
Minutes
F
r
e
q
u
e
n
c
y
Applied Regression -- Prof. Juran 25
Units
0
1
2
3
0 1 2 3 4 5 6 7 8 9 10 11
Units
F
r
e
q
u
e
n
c
y
Applied Regression -- Prof. Juran 26
Minutes vs. Units
0
20
40
60
80
100
120
140
160
180
0 2 4 6 8 10 12
Units
M
i
n
u
t
e
s
Applied Regression -- Prof. Juran 27
Estimating Parameters
Using Excel
Using Solver
Using analytical formulas
Applied Regression -- Prof. Juran 28
Using Excel (Scatter Diagram)
Applied Regression -- Prof. Juran 29
y = 15.509x + 4.1617
R = 0.9874
0
20
40
60
80
100
120
140
160
180
0 2 4 6 8 10 12
M
i
n
u
t
e
s
Units
Minutes vs. Units
Applied Regression -- Prof. Juran 30
Using Excel (Data Analysis)
Data Tab Data Analysis
Applied Regression -- Prof. Juran 31
Using Excel (Data Analysis)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
A B C D E F G H I
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9937
R Square 0.9874
Adjusted R Square 0.9864
Standard Error 5.3917
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 27419.5088 27419.5088 943.2009 0.0000
Residual 12 348.8484 29.0707
Total 13 27768.3571
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 4.1617 3.3551 1.2404 0.2385 -3.1485 11.4718 -3.1485 11.4718
Units 15.5088 0.5050 30.7116 0.0000 14.4085 16.6090 14.4085 16.6090
Applied Regression -- Prof. Juran 32
Using Solver
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
A B C D E F G H I
Minutes Units Predictions Errors Errors^2
1 23 1 19.6704 3.3296 11.0861
2 29 2 35.1792 -6.1792 38.1824
3 49 3 50.6880 -1.6880 2.8492
4 64 4 66.1967 -2.1967 4.8256
5 74 4 66.1967 7.8033 60.8909
6 87 5 81.7055 5.2945 28.0317
7 96 6 97.2143 -1.2143 1.4745
8 97 6 97.2143 -0.2143 0.0459
9 109 7 112.7230 -3.7230 13.8611
10 119 8 128.2318 -9.2318 85.2265
11 149 9 143.7406 5.2594 27.6614
12 145 9 143.7406 1.2594 1.5861
13 154 10 159.2494 -5.2494 27.5558
14 166 10 159.2494 6.7506 45.5712
348.8484
Intercept 4.1617
Slope 15.5088
=$B$17+$B$18*C3
=B5-E5
=F7^2
=SUM(G2:G15)
Applied Regression -- Prof. Juran 33
Applied Regression -- Prof. Juran 34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
A B C D E F G H I
Minutes Units Predictions Errors Errors^2
1 23 1 19.6704 3.3296 11.0861
2 29 2 35.1792 -6.1792 38.1824
3 49 3 50.6880 -1.6880 2.8492
4 64 4 66.1967 -2.1967 4.8256
5 74 4 66.1967 7.8033 60.8909
6 87 5 81.7055 5.2945 28.0317
7 96 6 97.2143 -1.2143 1.4745
8 97 6 97.2143 -0.2143 0.0459
9 109 7 112.7230 -3.7230 13.8611
10 119 8 128.2318 -9.2318 85.2265
11 149 9 143.7406 5.2594 27.6614
12 145 9 143.7406 1.2594 1.5861
13 154 10 159.2494 -5.2494 27.5558
14 166 10 159.2494 6.7506 45.5712
348.8484
Intercept 4.1617
Slope 15.5088
Applied Regression -- Prof. Juran 35
Using Formulas
( )( )
( )


=
2
1

x x
x x y y
i
i i
|
x y
1 0

| | =
RABE 2.13
RABE 2.13
Applied Regression -- Prof. Juran 36
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
A B C D E F G H I J K L M
Minutes Units Error (min) Error (units)
1 23 1 -74.2143 -5
2 29 2 -68.2143 -4
3 49 3 -48.2143 -3
4 64 4 -33.2143 -2
5 74 4 -23.2143 -2
6 87 5 -10.2143 -1
7 96 6 -1.2143 0
8 97 6 -0.2143 0
9 109 7 11.7857 1
10 119 8 21.7857 2
11 149 9 51.7857 3
12 145 9 47.7857 3
13 154 10 56.7857 4
14 166 10 68.7857 4
mean 97.21429 6 Slope 15.50877 Eq. 2.13
Intercept 4.161654 Eq. 2.14
=SUMPRODUCT(E2:E15,F2:F15)/(SUMPRODUCT(F2:F15,F2:F15))
=B18-F18*C18
Applied Regression -- Prof. Juran 37
Correlation and Regression
There is a close relationship between regression
and correlation. The correlation coefficient, ,
measures the degree to which random variables
X and Y move together or not.
= +1 implies a perfect positive linear
relationship while = -1 implies a perfect
negative linear relationship. = 0 essentially
implies independence.
Applied Regression -- Prof. Juran 38
Statistical Basics: Covariance
The covariance can be calculated using:

( )
( )( ) | | Y Y X X E Cov
XY
=
or equivalently

( )
( ) ( )( ) Y X XY E Cov
XY
=
Usually, we find it more useful to consider the coefficient of
correlation. That is,

( )
( )
Y X
XY
XY
Cov
Corr
o o
=
Sometimes the inverse relation is useful:

( ) ( ) XY Y X XY
Corr Cov o o =
Applied Regression -- Prof. Juran 39
Correlation and Regression
The sample (Pearson) correlation coefficient is



Regressions automatically produce an estimate of the squared
correlation called R
2
or R-square. Values of R-square close to 1
indicate a strong relationship while values close to 0 indicate a
weak or non-existent relationship
1
) , (
1 s = s
Y X
Y X Cov
o o



=
2 2
,
) ( ) (
) )( (
x x y y
x x y y
r
i i
i i
Y X
Applied Regression -- Prof. Juran 40
Some Validity Issues
We need to evaluate the strength of the relationship,
whether we have the proper functional form, and the
validity of the several statistical assumptions from a
practical and theoretical viewpoint using a multiplicity
of tools.
Fitted regression functions are interpolations of the data
in hand, and extrapolation is always dangerous.
Moreover, the functional form that fits the data in our
range of experience may not fit beyond it.
Applied Regression -- Prof. Juran 41
Regressions are based on past data. Why should the
same functional form and parameters hold in the
future?
In some uses of regression the future value of x may not
be known this adds greatly to our uncertainty.
In collecting data to do a regression choose x values
wisely when you have a choice. They should:
Be in the range where you intend to work
Be spread out along the range with some observations near
practical extremes
Have replicated values at the same x or at very nearby x values
for good estimation of o
Whenever possible test the stability of your model with
a holdout sample, not used in the original model
fitting.

Applied Regression -- Prof. Juran 42
Summary
Course Objectives & Description
Review of Basic Statistical Ideas
Intercept, Slope, Correlation, Causality
Simple Linear Regression
Statistical Model and Concepts
Regression in Excel
Computer Repair Example

Você também pode gostar