Você está na página 1de 45

# Correlation

## Correlation and Regression

 Correlation involves calculating
an index to measure the nature
of the relationship between
variables.
 With regression, an equation is
developed to predict the values
of a dependent variable.
Pearson Product Moment Coefficient r

##  The Pearson correlation

coefficient r varies over a range
of +1 through 0 to 1.
 It symbolizes the coefficients
estimate of linear association
based on the sampling data, The
coefficient represents the
population correlation.

## Pearson Product Moment Coefficient r

Correlation
coefficients reveal the
magnitude and
direction of
relationships.
Illustration of Direction:

 Positive Correlation
Family income vs. household food
expenditures
 Negative Correlation
Prices of products and services in
relation to their scarcity or
availability.

SCATTERPLOTS
 They are essential for
understanding the relationship
between variables.
 They provide a means for visual
inspection of data that a list of
values for two variables cannot.
Correlation Analysis
 Used to measure and interpret the
strength of association (linear
relationship) between two numerical
variables
Only concerned with strength of the
relationship
No causal effect is implied

Session 12.7

Scatter Diagram

## a plot of paired data to

determine or show a
relationship between two
variables
Paired Data

When there
appears to be a
linear relationship
between x and y:
attempt to fit a line to the
scatter diagram.
Linear Correlation

## The general trend of the points

seems to follow a straight line
segment.

Linear Correlation
Non-Linear Correlation

No Linear Correlation
High Linear
Correlation

## High Linear Correlation

Moderate Linear Correlation

Perfect Linear
Correlation

Coefficient, r

##  A measurement of the strength of the

linear association between two
variables

##  Also called the Pearson product-

moment correlation coefficient
Positive Linear
Correlation

## High values of x are paired with

high values of y and low values
of x are paired with low values
of y.

## High values of x are paired with

low values of y and low values of x
are paired with high values of y.
Little or No Linear Correlation

## Both high and low values of x are

sometimes paired with high values
of y and sometimes with low values
of y.

Positive Correlation

x
Negative Correlation

Little or No Linear
Correlation
y

x
What type of
correlation is
expected?
 Height and weight

 IQ and height

##  Years of driving experience and insurance rates

Linear correlation
coefficient

1 r +1
Table of Interpretation
Pearson r Qualitative Interpretation
1.00 Perfect Correlation
0.91 - 0.99 Very High Correlation
0.71 - 0.90 High Correlation
0.41 - 0.70 Marked Correlation
0.21 - 0.40 Slight/Low Correlation
0 - 0.20 Negligible Correlation

If r = 0, scatter diagram
might look like:
y

x
If r = +1, all points lie on
the least squares line
y

## If r = 1, all points lie on

the least squares line
y

x
1<r<0

0<r<1

x
Find the Correlation Coefficient
x y x2 y2 xy
(Miles) (Min.)
2 6 4 36 12
5 9 25 81 45
12 23 144 529 276
7 18 49 324 126
7 15 49 225 105
15 28 225 784 420
10 19 100 361 190
x = 58 y = 118 x2 = 596 y2=2340 xy = 1174

The Correlation
Coefficient,

r = 0.9753643
r 0.98
Warning

##  The correlation coefficient ( r)

measures the strength of the
relationship between two variables.
 Just because two variables are related
does not imply that there is a cause-
and-effect relationship between them.

Testing the
Correlation
Coefficient

## Determining whether a value of

the sample correlation
coefficient, r, is far enough from
zero to indicate correlation in
the population.
The Population
Correlation
Coefficient

## = Greek letter rho

Hypotheses to Test
Rho
 Assume that both variables x and y are
normally distributed.
 To test if the (x, y) values are correlated in
the population, set up the null hypothesis
that they are not correlated:

## H0: x and y are not correlated, so = 0.

Spearman Rank Correlation
A measure of Rank Correlation
The Spearman Correlation
 Spearmans correlation is designed to measure
the relationship between variables measured on
an ordinal scale of measurement.

##  Similar to Pearsons Correlation, however it

uses ranks as opposed to actual values.

Assumptions
 The data is a bivariate random variable.

## 1. Less sensitive to bias due to the effect of

outliers
- Can be used to reduce the weight of outliers (large distances
get treated as a one-rank difference)

## 3. When the intervals between data points are

problematic, it is advisable to study the
rankings rather than the actual values.

1. Calculations may become tedious. Additionally
ties are important and must be factored into
computation.
Steps in Calculating Spearmans Rho
1. Convert the observed values to ranks
(accounting for ties)
2. Find the difference between the ranks, square
them and sum the squared differences.
3. Set up hypothesis, carry out test and conclude
based on findings.

## Steps in Calculating Spearmans Rho

4. If the null is rejected then calculate the
Spearman correlation coefficient to measure
the strength of the relationship between the
variables.
Hypothesis: I
A. (Two-Tailed)
Ho : There is no correlation between the Xs and the Ys.
(there is mutual independence between the Xs and the Ys)

## H1 : There is a correlation between the Xs and the Ys.

(there is mutual dependence between the Xs and the Ys)

Spearmans Rho
Assumes values between -1 and +1

-1 0 +1

## Perfectly Negative Perfectly Positive

Correlation Correlation
Example 1
The ICC rankings for One Day International (ODI) and
Test matches for nine teams are shown below.
Team Test Rank ODI Rank
Australia 1 1
India 2 3
South Africa 3 2
Sri Lanka 4 7
England 5 6
Pakistan 6 4
New Zealand 7 5
West Indies 8 8

## Test whether there is correlation between the ranks

Example 1
Team Test Rank ODI Rank d d2
Australia 1 1 0 0
India 2 3 1 1
South Africa 3 2 1 1
Sri Lanka 4 7 3 9
England 5 6 1 1
Pakistan 6 4 2 4
New Zealand 7 5 2 4
West Indies 8 8 0 0
Bangladesh 9 9 0 0
Total 20

T = d i = 20
2
= 0.8333.
Example 2
A composite rating is given by executives to
each college graduate joining a plastic
manufacturing firm. The executive ratings
represent the future potential of the college
graduate. The graduates then enter an in-plant
training programme and are given another
composite rating. The executive ratings and the
in-plant ratings are as follows:

A 8 4
B 10 4
C 9 4
D 4 3
E 12 6
F 11 9
G 11 9
H 7 6
I 8 6
J 13 9
K 10 5
L 12 9

## A) At the 5% level of significance, determine if there

is a positive correlation between the variables
B) Find the rank correlation coefficient if the null is
rejected
Regression
Analysis

Purpose of Regression
Analysis
 Regression analysis is used primarily to
establish linear relationship between
variables and provide prediction
Predicts the value of a dependent (response)
variable based on the value of at least one
independent (explanatory) variable
Explains the relationship of the independent
variables on the dependent variable

Session 13.56
Types of Regression
Models
Positive Linear Relationship Relationship NOT Linear

## Negative Linear Relationship No Relationship

Session 13.57

Simple Linear
Regression
 Relationship between variables
is described by a linear function
 This function relates how much
change in the dependent variable
is associated with a unit increase
(or decrease) in the independent
variable.

Session 13.58
Population Linear Regression:
Simple Linear Regression Model
Population regression line is a straight line that describes the
relationship of the average value of one variable on the other

Population Population
Random
Y intercept Slope
Error
Coefficient
Dependent
(Response)
Variable Yi = 0 + 1 X i + i
Population Independent
Regression Line YX (Explanatory)
Variable
Session 13.59

## Random Error Term

 i is the random error term for the ith
observation
where i s are independently normally
2
distributed with mean 0 and variance
for i = 1,..,n, n is the number of
observations

Session 13.60
Random Error Term
 It represents the effect of other factors, apart
from X, which are omitted from the model
but do affect the response variable to some
extent

##  It may also account for errors of observation

or measurements in recording the response
variable

Session 13.61

## 4 Assumptions Made on the

Random Error Term

## 1. The error terms are independent from

one another;
2. The error terms are normally
distributed;
3. The error terms all have a mean of 0;
and
4. The error terms have constant
2
variance,

Session 13.62
Population Linear
Regression: Simple
Linear Regression Model
Y (Observed Value of Y) = Yi = 0 + 1 X i + i
1
i = Random Error

YX = 0 + 1 X i
0 (Conditional Mean)

Observed Value of Y
X

Session 13.63

Interpretation of the
Slope and the Intercept
 0 = E(Y | X = 0) is the average value of Y
when the value of X is zero.
E (Y | X )
 1 = measures the change in the
X
average value of Y as a result of a one-unit
change in X.

Session 13.64
Steps in Doing a Simple Linear
Regression Analysis
1. Obtain the equation that best fits the data;
2. Evaluate the equation to determine the strength
of the relationship for estimation and prediction;
3. Determine if the assumptions on the error terms
are satisfied and if model fits the data adequately;
4. Use the equation for prediction and description.

Session 13.65

Sample Linear
Regression
Sample regression line provides an estimate of the
population regression line as well as a predicted value of Y
Sample
Sample Slope
Y Intercept Coefficient

Yi = b0 + b1 X i + ei Residual

Y = b 0 + b1 X =(Fitted
Sample Regression Line
Regression Line, Predicted Value)
Session 13.66
Estimation using Method of Least
Squares

##  The estimates for the parameters 0

and 1 are obtained by minimizing the
sum of the squared errors
n n

(Y ) =
2 2
i YX i i
i =1 i =1

 b0 provides an estimate of 0
 b1 provides an estimate of 1
Session 13.67

##  As a result of LS estimation, the

values of b0 and b1 also minimize
the sum of the squared residuals.
n 2 n

(
i =1
Yi Yi ) = e i =1
2
i

Session 13.68
Sample Linear Regression

Yi = b0 + b1 X i + ei Yi = 0 + 1 X i + i
b1
Y
i 1
ei
YX = 0 + 1 X i
0 Y i = b0 + b1 X i
b0
X
Observed Value

Session 13.69

Interpretation of the
Slope and the Intercept
(Y | X = 0 ) is the estimated
b = E
0

## average value of Y when the value of X

is zero.
E (Y | X )
 b1 = is the estimated
X
change in the average value of Y as a
result of a one-unit change in X.
Session 13.70
EXAMPLE
Annual
Examine the linear Store Square Sales
relationship of the Feet (\$1000)
annual sales of 1 1,726 3,681
produce stores on
2 1,542 3,395
their size in square
3 2,816 6,653
footage. Find the
equation of the 4 5,555 9,543
straight line that fits 5 1,292 3,318
the data best. 6 2,208 5,563
7 1,313 3,760
Session 13.71

EXAMPLE

Yi = b0 + b1 X i
= 1636.415 +1.487 X i
From Excel Printout:
C o e ffi c ie n ts
I n te rc e p t 1636.414726
X V a ri a b l e 1 1 .4 8 6 6 3 3 6 5 7

Session 13.72
EXAMPLE
12000
Annua l S a le s (\$000)
10000

8000

7X i
1.48
6000

15 +
36.4
4000
= 16
2000 Yi
0
0 1000 2000 3000 4000 5000 6000

S q u a re F e e t

Session 13.73

EXAMPLE
Yi = 1636.415 +1.487 X i
The slope of 1.487 means that for each increase of one
unit in X, we predict the average of Y to increase by an
estimated 1.487 units.

## The model estimates that for each increase of one

square foot in the size of the store, the expected
annual sales are predicted to increase by \$1,487.

Session 13.74
RESIDUAL ANALYSIS
 Purposes
Examine linearity
Evaluate assumptions to see if any
is violated
 Graphical Analysis of Residuals
Plot residuals vs. Xi ,Y i (and time
if necessary)

Session 13.75

## Residual Analysis for

Linearity
Y Y

X X
e e
X
X

Not Linear
 Linear
Session 13.76
Residual Analysis for
Homoscedasticity

Y Y

X
X
SR SR
X X

Heteroscedasticity
 Homoscedasticity

Session 13.77

Residual Analysis:Excel
Output
Observation Predicted Y Residuals
1 4202.344417 -521.3444173
2 3928.803824 -533.8038245
3 5822.775103 830.2248971
Excel Output 4 9894.664688 -351.6646882
5 3557.14541 -239.1454103
6 4918.90184 644.0981603
7 3588.364717 171.6352829
Residual Plot

## Session 13.78 Square Feet

Inference about the Slope: t
Test
 t test for a population slope
Is there a linear relationship of Y on X ?
 Null and alternative hypotheses
H 0: 1 = 0 (no linear relationship)
H 1: 1 0 (linear relationship)
 Test statistic
MSE
b1 where Sb1 =
t= n
2

Sb1 n
Xi
X i2 i =1
n i =1 n
(Y Y )
2
i i
SSE i =1
where MSE = =
n2 n2
Session 13.79

Example: Produce
Store
Data for Seven Stores:
Annual
Store Square Sales Estimated Regression Equation:
Feet (\$000)
1 1,726 3,681 Yi = 1636.415 +1.487Xi
2 1,542 3,395
3 2,8166,653 The slope of this model is
4 5,5559,543 1.487.
5 1,2923,318
Is square footage of the
6 2,2085,563
store affecting its annual
7 1,3133,760
sales?

Session 13.80
Inferences about the Slope:
t-test
H0: 1 = 0 Test Statistic:

## H1: 1 0 From Excel Printout b1 Sb1 t

= .05 Coefficients Standard Error t Stat P-value
df = 7 - 2 = 5 Intercept 1636.4147 451.4953 3.6244 0.01515
Critical Values: Footage 1.4866 0.1650 9.0099 0.00028
Decision: Reject H0
Reject Reject
Conclusion:
.025 .025 There is evidence that square footage
affects annual sales.
-2.5706 0 2.5706 t
Session 13.81

Pitfalls of Regression
Analysis

##  Lacking an awareness of the assumptions

underlying least-squares regression
 Not knowing how to evaluate assumptions
 Not knowing the alternatives to classical
regression if some assumption is violated
 Using a regression model without
knowledge of the subject matter

Session 13.82
Strategies for Avoiding the
Pitfalls of Regression

##  Start with a scatter plot of X on Y to

observe possible relationship
 Perform residual analysis to check the
assumptions
Use a histogram, stem-and-leaf
display, box-and-whisker plot, or
normal probability plot of the
residuals to uncover possible non-
normality

Session 13.83

## Strategies for Avoiding the

Pitfalls of Regression

##  If there is violation of any assumption,

use alternative methods to least-
squares regression or alternative least-
squares models (e.g.: Curvilinear or
multiple regression)
 If there is no evidence of assumption
violation, then test for the significance of
the regression coefficients

Session 13.84
Problem Set

## c) Find the equation of the regression model that will predict

cost per day given the length of services in days.