Você está na página 1de 45

Correlation

Correlation and Regression


 Correlation involves calculating
an index to measure the nature
of the relationship between
variables.
 With regression, an equation is
developed to predict the values
of a dependent variable.
Pearson Product Moment Coefficient r

 The Pearson correlation


coefficient r varies over a range
of +1 through 0 to 1.
 It symbolizes the coefficients
estimate of linear association
based on the sampling data, The
coefficient represents the
population correlation.

Pearson Product Moment Coefficient r

Correlation
coefficients reveal the
magnitude and
direction of
relationships.
Illustration of Direction:

 Positive Correlation
Family income vs. household food
expenditures
 Negative Correlation
Prices of products and services in
relation to their scarcity or
availability.

SCATTERPLOTS
 They are essential for
understanding the relationship
between variables.
 They provide a means for visual
inspection of data that a list of
values for two variables cannot.
Correlation Analysis
 Used to measure and interpret the
strength of association (linear
relationship) between two numerical
variables
Only concerned with strength of the
relationship
No causal effect is implied

Session 12.7

Scatter Diagram

a plot of paired data to


determine or show a
relationship between two
variables
Paired Data

When there
appears to be a
linear relationship
between x and y:
attempt to fit a line to the
scatter diagram.
Linear Correlation

The general trend of the points


seems to follow a straight line
segment.

Linear Correlation
Non-Linear Correlation

No Linear Correlation
High Linear
Correlation

Points lie close to a straight line.

High Linear Correlation


Moderate Linear Correlation

Low Linear Correlation


Perfect Linear
Correlation

The Sample Correlation


Coefficient, r

 A measurement of the strength of the


linear association between two
variables

 Also called the Pearson product-


moment correlation coefficient
Positive Linear
Correlation

High values of x are paired with


high values of y and low values
of x are paired with low values
of y.

Negative Linear Correlation

High values of x are paired with


low values of y and low values of x
are paired with high values of y.
Little or No Linear Correlation

Both high and low values of x are


sometimes paired with high values
of y and sometimes with low values
of y.

Positive Correlation

x
Negative Correlation

Little or No Linear
Correlation
y

x
What type of
correlation is
expected?
 Height and weight

 Mileage on tires and remaining tread

 IQ and height

 Years of driving experience and insurance rates

Linear correlation
coefficient

1 r +1
Table of Interpretation
Pearson r Qualitative Interpretation
1.00 Perfect Correlation
0.91 - 0.99 Very High Correlation
0.71 - 0.90 High Correlation
0.41 - 0.70 Marked Correlation
0.21 - 0.40 Slight/Low Correlation
0 - 0.20 Negligible Correlation

If r = 0, scatter diagram
might look like:
y

x
If r = +1, all points lie on
the least squares line
y

If r = 1, all points lie on


the least squares line
y

x
1<r<0

0<r<1

x
Find the Correlation Coefficient
x y x2 y2 xy
(Miles) (Min.)
2 6 4 36 12
5 9 25 81 45
12 23 144 529 276
7 18 49 324 126
7 15 49 225 105
15 28 225 784 420
10 19 100 361 190
x = 58 y = 118 x2 = 596 y2=2340 xy = 1174

The Correlation
Coefficient,

r = 0.9753643
r 0.98
Warning

 The correlation coefficient ( r)


measures the strength of the
relationship between two variables.
 Just because two variables are related
does not imply that there is a cause-
and-effect relationship between them.

Testing the
Correlation
Coefficient

Determining whether a value of


the sample correlation
coefficient, r, is far enough from
zero to indicate correlation in
the population.
The Population
Correlation
Coefficient

= Greek letter rho

Hypotheses to Test
Rho
 Assume that both variables x and y are
normally distributed.
 To test if the (x, y) values are correlated in
the population, set up the null hypothesis
that they are not correlated:

H0: x and y are not correlated, so = 0.


Spearman Rank Correlation
A measure of Rank Correlation
The Spearman Correlation
 Spearmans correlation is designed to measure
the relationship between variables measured on
an ordinal scale of measurement.

 Similar to Pearsons Correlation, however it


uses ranks as opposed to actual values.

Assumptions
 The data is a bivariate random variable.

 The measurement scale is at least ordinal.


Advantages

1. Less sensitive to bias due to the effect of


outliers
- Can be used to reduce the weight of outliers (large distances
get treated as a one-rank difference)

2. Does not require assumption of normality.

3. When the intervals between data points are


problematic, it is advisable to study the
rankings rather than the actual values.

Disadvantages
1. Calculations may become tedious. Additionally
ties are important and must be factored into
computation.
Steps in Calculating Spearmans Rho
1. Convert the observed values to ranks
(accounting for ties)
2. Find the difference between the ranks, square
them and sum the squared differences.
3. Set up hypothesis, carry out test and conclude
based on findings.

Steps in Calculating Spearmans Rho


4. If the null is rejected then calculate the
Spearman correlation coefficient to measure
the strength of the relationship between the
variables.
Hypothesis: I
A. (Two-Tailed)
Ho : There is no correlation between the Xs and the Ys.
(there is mutual independence between the Xs and the Ys)

H1 : There is a correlation between the Xs and the Ys.


(there is mutual dependence between the Xs and the Ys)

Spearmans Rho
Assumes values between -1 and +1

-1 0 +1

Perfectly Negative Perfectly Positive


Correlation Correlation
Example 1
The ICC rankings for One Day International (ODI) and
Test matches for nine teams are shown below.
Team Test Rank ODI Rank
Australia 1 1
India 2 3
South Africa 3 2
Sri Lanka 4 7
England 5 6
Pakistan 6 4
New Zealand 7 5
West Indies 8 8
Bangladesh 9 9

Test whether there is correlation between the ranks

Example 1
Team Test Rank ODI Rank d d2
Australia 1 1 0 0
India 2 3 1 1
South Africa 3 2 1 1
Sri Lanka 4 7 3 9
England 5 6 1 1
Pakistan 6 4 2 4
New Zealand 7 5 2 4
West Indies 8 8 0 0
Bangladesh 9 9 0 0
Total 20

Answer:
T = d i = 20
2
= 0.8333.
Example 2
A composite rating is given by executives to
each college graduate joining a plastic
manufacturing firm. The executive ratings
represent the future potential of the college
graduate. The graduates then enter an in-plant
training programme and are given another
composite rating. The executive ratings and the
in-plant ratings are as follows:

Graduate Executive rating (X) Training rating (Y)


A 8 4
B 10 4
C 9 4
D 4 3
E 12 6
F 11 9
G 11 9
H 7 6
I 8 6
J 13 9
K 10 5
L 12 9

A) At the 5% level of significance, determine if there


is a positive correlation between the variables
B) Find the rank correlation coefficient if the null is
rejected
Regression
Analysis

Purpose of Regression
Analysis
 Regression analysis is used primarily to
establish linear relationship between
variables and provide prediction
Predicts the value of a dependent (response)
variable based on the value of at least one
independent (explanatory) variable
Explains the relationship of the independent
variables on the dependent variable

Session 13.56
Types of Regression
Models
Positive Linear Relationship Relationship NOT Linear

Negative Linear Relationship No Relationship

Session 13.57

Simple Linear
Regression
 Relationship between variables
is described by a linear function
 This function relates how much
change in the dependent variable
is associated with a unit increase
(or decrease) in the independent
variable.

Session 13.58
Population Linear Regression:
Simple Linear Regression Model
Population regression line is a straight line that describes the
relationship of the average value of one variable on the other

Population Population
Random
Y intercept Slope
Error
Coefficient
Dependent
(Response)
Variable Yi = 0 + 1 X i + i
Population Independent
Regression Line YX (Explanatory)
Variable
Session 13.59

Random Error Term


 i is the random error term for the ith
observation
where i s are independently normally
2
distributed with mean 0 and variance
for i = 1,..,n, n is the number of
observations

Session 13.60
Random Error Term
 It represents the effect of other factors, apart
from X, which are omitted from the model
but do affect the response variable to some
extent

 It may also account for errors of observation


or measurements in recording the response
variable

Session 13.61

4 Assumptions Made on the


Random Error Term

1. The error terms are independent from


one another;
2. The error terms are normally
distributed;
3. The error terms all have a mean of 0;
and
4. The error terms have constant
2
variance,

Session 13.62
Population Linear
Regression: Simple
Linear Regression Model
Y (Observed Value of Y) = Yi = 0 + 1 X i + i
1
i = Random Error

YX = 0 + 1 X i
0 (Conditional Mean)

Observed Value of Y
X

Session 13.63

Interpretation of the
Slope and the Intercept
 0 = E(Y | X = 0) is the average value of Y
when the value of X is zero.
E (Y | X )
 1 = measures the change in the
X
average value of Y as a result of a one-unit
change in X.

Session 13.64
Steps in Doing a Simple Linear
Regression Analysis
1. Obtain the equation that best fits the data;
2. Evaluate the equation to determine the strength
of the relationship for estimation and prediction;
3. Determine if the assumptions on the error terms
are satisfied and if model fits the data adequately;
4. Use the equation for prediction and description.

Session 13.65

Sample Linear
Regression
Sample regression line provides an estimate of the
population regression line as well as a predicted value of Y
Sample
Sample Slope
Y Intercept Coefficient

Yi = b0 + b1 X i + ei Residual

Y = b 0 + b1 X =(Fitted
Sample Regression Line
Regression Line, Predicted Value)
Session 13.66
Estimation using Method of Least
Squares

 The estimates for the parameters 0


and 1 are obtained by minimizing the
sum of the squared errors
n n

(Y ) =
2 2
i YX i i
i =1 i =1

 b0 provides an estimate of 0
 b1 provides an estimate of 1
Session 13.67

Sample Linear Regression

 As a result of LS estimation, the


values of b0 and b1 also minimize
the sum of the squared residuals.
n 2 n

(
i =1
Yi Yi ) = e i =1
2
i

Session 13.68
Sample Linear Regression

Yi = b0 + b1 X i + ei Yi = 0 + 1 X i + i
b1
Y
i 1
ei
YX = 0 + 1 X i
0 Y i = b0 + b1 X i
b0
X
Observed Value

Session 13.69

Interpretation of the
Slope and the Intercept
(Y | X = 0 ) is the estimated
b = E
0

average value of Y when the value of X


is zero.
E (Y | X )
 b1 = is the estimated
X
change in the average value of Y as a
result of a one-unit change in X.
Session 13.70
EXAMPLE
Annual
Examine the linear Store Square Sales
relationship of the Feet ($1000)
annual sales of 1 1,726 3,681
produce stores on
2 1,542 3,395
their size in square
3 2,816 6,653
footage. Find the
equation of the 4 5,555 9,543
straight line that fits 5 1,292 3,318
the data best. 6 2,208 5,563
7 1,313 3,760
Session 13.71

EXAMPLE

Yi = b0 + b1 X i
= 1636.415 +1.487 X i
From Excel Printout:
C o e ffi c ie n ts
I n te rc e p t 1636.414726
X V a ri a b l e 1 1 .4 8 6 6 3 3 6 5 7

Session 13.72
EXAMPLE
12000
Annua l S a le s ($000)
10000

8000

7X i
1.48
6000

15 +
36.4
4000
= 16
2000 Yi
0
0 1000 2000 3000 4000 5000 6000

S q u a re F e e t

Session 13.73

EXAMPLE
Yi = 1636.415 +1.487 X i
The slope of 1.487 means that for each increase of one
unit in X, we predict the average of Y to increase by an
estimated 1.487 units.

The model estimates that for each increase of one


square foot in the size of the store, the expected
annual sales are predicted to increase by $1,487.

Session 13.74
RESIDUAL ANALYSIS
 Purposes
Examine linearity
Evaluate assumptions to see if any
is violated
 Graphical Analysis of Residuals
Plot residuals vs. Xi ,Y i (and time
if necessary)

Session 13.75

Residual Analysis for


Linearity
Y Y

X X
e e
X
X

Not Linear
 Linear
Session 13.76
Residual Analysis for
Homoscedasticity

Y Y

X
X
SR SR
X X

Heteroscedasticity
 Homoscedasticity

Session 13.77

Residual Analysis:Excel
Output
Observation Predicted Y Residuals
1 4202.344417 -521.3444173
2 3928.803824 -533.8038245
3 5822.775103 830.2248971
Excel Output 4 9894.664688 -351.6646882
5 3557.14541 -239.1454103
6 4918.90184 644.0981603
7 3588.364717 171.6352829
Residual Plot

0 1000 2000 3000 4000 5000 6000

Session 13.78 Square Feet


Inference about the Slope: t
Test
 t test for a population slope
Is there a linear relationship of Y on X ?
 Null and alternative hypotheses
H 0: 1 = 0 (no linear relationship)
H 1: 1 0 (linear relationship)
 Test statistic
MSE
b1 where Sb1 =
t= n
2

Sb1 n
Xi
X i2 i =1
n i =1 n
(Y Y )
2
i i
SSE i =1
where MSE = =
n2 n2
Session 13.79

Example: Produce
Store
Data for Seven Stores:
Annual
Store Square Sales Estimated Regression Equation:
Feet ($000)
1 1,726 3,681 Yi = 1636.415 +1.487Xi
2 1,542 3,395
3 2,8166,653 The slope of this model is
4 5,5559,543 1.487.
5 1,2923,318
Is square footage of the
6 2,2085,563
store affecting its annual
7 1,3133,760
sales?

Session 13.80
Inferences about the Slope:
t-test
H0: 1 = 0 Test Statistic:

H1: 1 0 From Excel Printout b1 Sb1 t


= .05 Coefficients Standard Error t Stat P-value
df = 7 - 2 = 5 Intercept 1636.4147 451.4953 3.6244 0.01515
Critical Values: Footage 1.4866 0.1650 9.0099 0.00028
Decision: Reject H0
Reject Reject
Conclusion:
.025 .025 There is evidence that square footage
affects annual sales.
-2.5706 0 2.5706 t
Session 13.81

Pitfalls of Regression
Analysis

 Lacking an awareness of the assumptions


underlying least-squares regression
 Not knowing how to evaluate assumptions
 Not knowing the alternatives to classical
regression if some assumption is violated
 Using a regression model without
knowledge of the subject matter

Session 13.82
Strategies for Avoiding the
Pitfalls of Regression

 Start with a scatter plot of X on Y to


observe possible relationship
 Perform residual analysis to check the
assumptions
Use a histogram, stem-and-leaf
display, box-and-whisker plot, or
normal probability plot of the
residuals to uncover possible non-
normality

Session 13.83

Strategies for Avoiding the


Pitfalls of Regression

 If there is violation of any assumption,


use alternative methods to least-
squares regression or alternative least-
squares models (e.g.: Curvilinear or
multiple regression)
 If there is no evidence of assumption
violation, then test for the significance of
the regression coefficients

Session 13.84
Problem Set

c) Find the equation of the regression model that will predict


cost per day given the length of services in days.