Você está na página 1de 5

Chapter five.

Simple Linear Regression and correlation analysis


9.1.

Simple Linear Regression Analysis


Regression is concerned with bringing out the nature of relation ship and using it to know the best
approximate value of one variable corresponding to a known value of other variable
Simple linear regression deals with method of fitting a straight line (regression line) on a sample of data
of two variables in terms of equation so that if the value of one variable is given we can predict the value
of the other variable.
In other words if we have two variables under study one may represent the cause and the other may
represent the effect. The variable representing the cause is known as independent (predictor or regressor)
variable and it is usually denoted by X. The variable representing the effect is known as dependent
(predicted) variable and is usually denoted by Y. Then, if the relationship between the two variables is a
straight line, it is known as simple linear regression.
When there are more than two variables and one of them is assumed to be dependent up on the others, the
functional relationship between the variables is known as multiple linear regressions.
Scatter diagram: is a plot of all ordered pairs (x, y) on the coordinate plane which is necessary to discover
weather the relationship b/n two variables indeed best explained by straight line.
Example:
Advertizing budget (X)
Profit(Y)

Y
13
12
11
10
9
8
7
6
5
4
3
2

5
8

6
7

x
x

x
x
x
x

7
9

8
10

9
13

10
12

11
13

1
1 2 3 4 5 6 7 8 9 10 11

So if we draw a line, the regression line is one that passes through almost all or closest to all points in the
scatter diagram.
Y
x

xx x
x

x
x x

x
x

X
The simple linear regression of Y on X in the population is given by:

Y = + X +
Where
= y-intercept
= slope of the line or regression coefficient
=is the error term
The y-intercept and the regression coefficient are the population parameters. We obtain the estimates
of and from the sample. The estimators of and are denoted by a and b, respectively. The fitted
regression line is thus,
Ye = a + b X
The above algebraic equation is known as a regression line. The method of finding such a relationship is
known as fitting regression line. For each observed value of the variable X, we can find out the value of
Y. The computed values of Y are known as the expected values of Y and are denoted by Ye.
The observed values of Y are denoted by Y. The difference between the observed and the expected values
Y-Ye, is known as error or residual, and is denoted by e. The residual can be positive, negative or zero.

A best fitting line is one for which the sum of squares of the residuals,
purpose the principle called the method of least squares is used.
According to the principle of least squares, one would select a and b such that

, is minimum. For this

(Y- Ye) is minimum where Ye = a+ bx.

To minimize this function, first we take the partial derivatives of


with respect to a and b. Then the
partial derivatives are equated to zero separately. These will result in the following normal equations:

y na b x
xy a x b x

Solving these normal equations simultaneously we can get the values of a and b as follows:

xy
x

x y

n
and
( x ) 2
n

a y bx
Regression analysis is useful in predicting the value of one variable from the given values of another
variable.
Example: A researcher wants to find out if there is any relationship b/n height of the son and his father.
He took random sample 6 fathers and their sons. The height in inch is given in the table bellow (i) Find
the regression line of Y on X
(ii) What would be the height of the son if his fathers height is 70 inch?
Height of father (X)
Height of the son (Y)

Solution :

63
66

X 396 Y 425 X
,

xy

x y

65
88
2

66
65

26152
,

67
67

XY 26740 Y

6(26740) (396)( 405)


n

0.625
2
2
(
x
)

6(26152) (396)
x2 n
Y b X 405 (0.625)(396) 67.5
a y bx
n
6

(i)

67
69

27355

68
70

Y=26.25+0.625X
If X=70, then
Y=26.25+0.625(70) =70, thus the height of the son is 70 inch

(ii)

9.2 Simple Linear Correlation Analysis


The measure of the degree of relationship between two continuous variables is known as
correlation coefficient. The population correlation coefficient is represented by and its estimator by r.
The correlation coefficient r is also called Pearsons correlation coefficient since it was developed by Karl
Pearson. r is given as the ratio of the covariance of the variables x and y to the product of the standard
deviations of x and y. Symbolically,

( x x )( y y )

Cor ( x, y )

sd ( x).sd (Y )

n 1
( x x ( y y)
2

n 1

( x x )( y y )
(x x) ( y y)
2

n 1

x y
xy n
( X ) )( y ( y )
( x

n
n
2

=
The numerator is termed as the sum of products of x and y, SP xy. In the denominator, the first term is
called the sum of squares of x, SSx, and the second term is called the sum of squares of y, SS y. Thus,

SPxy
SS x SS y
r=

r 1.

The correlation coefficient is always between 1 and +1, i.e.,-1


r = -1 implies perfect negative linear correlation between the variables under
Consideration
r = +1 implies perfect positive linear correlation between the variables under

Consideration
r = 0 implies there is no linear relationship between the two variables: but there could be a non-linear
relationship between them. In other words, when two variables are uncorrelated, r = 0, but when r = 0, it
is not necessarily true that the variables are uncorrelated.
x perfect negative
Correlation(r = -1)
x

perfect positive x
correlation
(r = 1) x

no correlation
(r = 0)
x x
x

9.3 Coefficient of Determination(R2)


The square of the correlation coefficient, r 2, is called the coefficient of
determination.
It measures the variation in the dependent Y variable explained by variation in the independent variable
X.
For example, if r = 0.8, then r2 = 0.64. This means on the basis of the sample approximately 64% of
the variation in the dependent variable, say Y, is caused by the variation of the independent variable, say
X. The remaining, 1-r2, 36% variation in Y is unexplained by variation in X. In other words, variables
(factors) other than X could have caused the remaining 36% variation in Y.
Example: the research director of the Dubbary Saving and Loan Bank collected 24 observation of
montage interest rates X and number of house sales Y at each interest rate. The director computed that,

276, y 768, x i 3300, y 2500, xi y 8690


i
i
i
i
2

Then compute (i) Coefficient of correlation.


(iii)
The coefficient of determination.
Solution:

r
(i)
(ii)

( x x )( y y )
( x x ) ( y y)
2

24(86.9) 276(768)
2

2
2
24(3300)
24(2500)

(
276
)
(
768
)

0.61

Coefficient of determination (R2) = r2= (0.61)2 =0.37 this shows that 37% of the variation in the
number of house holds is due to the variation in the interest rate.

Você também pode gostar