Escolar Documentos
Profissional Documentos
Cultura Documentos
Correlation and regression are concerned with the investigation of relationships between two or more variables.
If a relationship exists between those variables If so, how strong that relationship is What form that relationship takes Can we make use of that relationship for predictive purposes i.e. forecasting?
Correlation is used to find the strength of the relationship Regression describes the relationship itself in the form of an equation which best fits the data
find the equation of the line of best fit through the data, using linear regression
The 'goodness of fit' statistic can be calculated to see how useful the regression equation is likely to be Once defined by an equation, the relationship can be used for predictive purposes.
Example The data represents a sample of advertising expenditures and sales for ten randomly selected months. See slide 12 for complete data.
Month Advertising expenditure (0,000 s) x 1.2 0.8 1.0 Sales (0.000 s) y 101 92 110 etc.
1 2 3
100
90
80
70 0.6 0.7 0.8 0.9 1.0 1.1 advertising (0,000's) 1.2 1.3
The graph suggests a linear relationship between sales and advertising expenditure. The larger the amount spent on advertising the higher the sales in general.
If there is a relationship, we need to be able to measure the strength of that relationship. i.e. calculate the value of the correlation coefficient
Pearson's Product Moment Correlation Coefficient (r) is a measure of how close a linear relationship there is between x and y. can be produced directly from a calculator in LR (linear regression) mode For the sales and advertising data the correlation coefficient: r = 0.875 The value of r is always between + 1 and -1
r = -0.7
2 4 6 8 x 10 12 14
5 0
r = 0 no correlation
r = +0.8
1 2 3 4 5 6 7 8 9 10 Totals
Step 2
Therefore:
= 0.444
Syy = 7y2 - 7y 7y = 93569 - 959 x 959 n 10 Sxy = 7xy - 7x 7y = 924.8 - 9.4 x 959 n 10
= 1600.9
= 23.34
Step 3
Therefore: r = Sxy Sxx Syy = 23.34 0.444 x 1600.9 = 0.875
Null hypothesis (H0): A linear relationship does not exist between sales and advertising Alternative hypothesis(H1): A linear relationship does exist between sales and advertising. If we calculate a test statistic and critical value we discover that test statistic > critical value so we reject H0
Conclude that a linear relationship exists between sales and amount spent on advertising.
The Goodness of Fit Statistic (R2) This also measures of the closeness of the relationship between x and y R2 = 100r2 R2 tells us what percentage of the total variation in y (here sales) is explained by the variation in x (here advertising expenditure)
Interpretation:
If r = +1 or 1, then R2 =100% So 100% of the variation in y is explained by the variation in x. If r = 0, then R2 = 0% So none of the variation in y is explained by the variation in x For the data above the goodness of fit statistic R2 = 100 r2 = 100 x 0.8752 = 76.6%
76.6% of the variation in sales is explained by the variation in the amount spent on advertising. The remaining 23.4% of the variation is explained by other factors: e.g. price competitor s prices etc.
Regression equation Since we know, for the sample data, that there is a significant relationship between the two variables, the next obvious step is to find its equation. We can then add the regression line to the scatter diagram and use it to predict future sales, given advertising expenditure for a particular month. The regression equation can be produced directly from a calculator in LR mode.
The regression line has the equation: y = a + bx x is the independent variable y is the dependent variable a is the intercept on the y-axis b is the gradient or slope of the line.
For the sales and advertising data, the values of a and b are 46.5 and 52.6. So regression equation is: y = 46.5 + 52.6x Sales = 46.5 + 52.6 advertising (a and b can be found using LR mode on your calculator or by calculation)
Where x , y are the means of the x and y data and the S s are defined as previously.
Calculations for the regression equation. In the regression equation y = a + bx b = Sxy Sxx = 23.34 = 52.6 0.444
a = y - b x = 95.9 - 52.6 x 0.94 = 46.5 (As y = 7y = 959 and x = 7x = 9.4 = 0.94) n 10 n 10 Therefore the regression equation is y = 46.5 + 52.6x
Plotting the regression equation on the scatter diagram. The line y = a + bx can be plotted on the scatter diagram by plotting three points. The centroid ( x , y ) and any other two points, which satisfy the regression equation. From the data (x, y) = (0.94, 95.9) Plot (0.94,95.9)
When x = 0.6, y = 46.5 + (52.6 x 0.6) = 78.06 Plot (0.6, 78.6) When x = 1.2, y = 46.5 + (52.6 x 1.2) = 109.6 Plot (1.3, 109.6)
110
sales
100
x x
90
80
Note regression equation y = a + bx can only be used to calculate an estimate for y given the value of x
The linear relationship y = a + bx can only be assumed to exist between y and x for the range of values within the sample
Interpreting the coefficients in the regression equation first the a value The intercept (a) is the estimate of y when x = 0, but care is needed if using this y = 46.5 + 52.6x Sales = 46.5 + 52.6 advertising
why?
When x = 0, y = 46.5 i.e. When nothing is spent on advertising, sales would be expected on average to be 46.5 units = 46.5 x 10,0000 = 465,000
the b value y = 46.5 + 52.6x 52.6x If x If x If x If x If x If x etc. So if advertising expenditure is increased by 1 unit, sales will be increased by 52.6 units on average. = = = = = = 0 0.6 0.8 1 1.2 2 y = 46.5, but care is needed here! y = 46.5 + (52.6)(0.6) = y = 46.5 + (52.6)(0.8) = y = 46.5 + 52.6 = y = 46.5 + (52.6)(1. 2) = y = 46.5 + 52.6 x 2 but care is needed
here also!
For each additional 10,000 spent on advertising, sales will increase by 52.6 x 10,000 = 526,000 on average. But we cannot estimate sales outside the range: E.g. we should not try to estimate sales for x = 5 using this method.