Você está na página 1de 9

Introduction In studying the relationship between variable there is need to develop numerical measures that will express the

relationship between the variables. The questions usually asked are is the relationship strong or weak, direct or inverse? In addition to that we develop an equation to express the relationship between variables. We begin this chapter by examining the meaning and purpose of correlation analysis. The first stepis to develop a mathematical equation that will allow us to estimate the value of one variable based on the value of another.This is called regression analysis and there are four important steps which are 1. 2. 3. 4. Determining the equation of the line that best fits the data Use the equation to estimate the value of one variable based on another variable Measure the errorof our estimate Establish confidence And prediction intervals for our estimates

Correlation analysis Correlation analysis is the study of the relationship between variables. To explain suppose the sales manager of Home Corp which has a large sales force throughout Namibia wants to determine the relationship between the number of sales calls made in a month and the number of actual sales made. The manager selects a random sample of 10 representatives and determines the number of sales calls each representative made in the last month and the number of actual sales. The sample information is shown below Sales Representative Tom Keller Jeff Hall Brian Virost Greg Fish Susan Welch Carlos Ramirez Rich Niles Mike Kiel Mark Reynolds Soni Jones Number of Sales calls 20 40 20 30 10 10 20 20 20 30 Actual sales 30 60 40 60 30 40 40 50 30 70

By reviewing the data we observe that there seem to be some relationship between the number of sales calls and the actual units sold. This shows that the number of sales people who made most calls sold the most units. The relationship is not perfect or exact since Soni Jones made fewer calls than Jeff Halls but she sold more units. To express the degree of relationship, the first step is usually to construct a scatterplot. In this case the actual sales made is the dependent variable while the sales calls made is the independent variable. The depentant variable will be scaled on th Y-axis while the independent variable will be scaled on the X-axis. The figure below shows a scatter plot for the Number of actual units sold against the number of sales calls made

80 70 60 Actual Sales 50 40 30 20 10 0 0 10 20 Sales Calls 30 40 50

The scatter plot diagram shows graphically thatthe sales reprepresentatives who make more calls tend to achieve higher sales volumes. It is however important to note that all the points do not fall on a line. The coming section will measure the strength and direction of this relationship by determiningg the coefficient of correlation.

Coefficient of correlation This is a measure of the strength of the relationship between two set of interval-scaled or ratio scaled variables. This is denoted as r, it is also referred to as Pearsons r and as the product moment correlation coefficient. It assumes any value between -1 or +1. A corelation coefficient of -1 or +1 indicates perfect corelation meaning the variables are related in a pefect linear sense. If the variables are not related at all the correlaton coefficient is zero. When the corelation coefficient is close to zero the relationship is said to be weak. A correlation coefficient of close to 1 indicates a strong linear relationship. Calculating the correlation coefficient

Name Tom Keller Jeff Hall Brian Virost Greg Fish Susan Welch Carlos Ramirez Rich Niles Mike Kiel Mark Reynolds Soni Jones

Calls, X 20 40 20 30 10 10 20 20 20 30

Actual Sales, Y 30 60 40 60 30 40 40 50 30 70

X- -2 18 -2 8 -12 -12 -2 -2 -2 8

Y- -15 15 -5 15 -15 -5 -5 5 -15 25

(X-)( Y-) 30 270 10 120 180 60 10 -10 30 200


= 0.759

Coefficient of determination In the previous example regarding the relationship between variables we are tempted to assume that an increase or decrease in one variable causes a change in the other variable. What we can conclude when we find two variables with a strong correlation is that there is a relationship or association not that a change in one causes a change in the other. The coefficient of determination is computed by squaring the correlation coefficient and this is a measure of the causal effect of the dependent variable. The corelation coefficient in the previous examples was 0.759 therefore the coefficient of determination is which is 0.576. This means that 57.6 percent of variation in actual sales is attributed to the sales calls made in that month.

Regresion Analysis Least Squares Principle The scatter diagram produced in the previous section with a line drawn with a ruler would probably fit the data. To estimate the value of the dependant variable Y based on the independant variable X a technique caled regression analysis is used. The equation used to calculate Y on the basis of X is called the regression equation. The least squares method is used to minimise the sum of squares of the actual Y values and the predicted values of Y. The sum of the error terms in the Least Squares Line will equal to zero The general form of the linear regression equation is = a + bX Where is the estimated value of the Y variable for a selected value of X . a is the Y- intercept. It is the estimated value of the Y when the value of X is zero X is any value of the independent variable that is selected The formulas for a and b are The slope of the regression line The Y-intercept

b =r a = - b

Multiple Regression and Correlation Analysis In multiple linear there is use of additional variables denoted as X, X, X, X.... Xn that help in predicting the dependant variable Y. Multiple linear regression analysis can be either descriptive or inferential technique. The generalised multiple regression equation is = a + bX + bX + bX + bX Where v is the numbe of independent variables a is the intercept whe all idependent variables are equal to zero b is the amount by which the Y chaches when that particular X is changed by a unit with all values of the oher independent variables kept constant. The relationship beween Y, X and X is as follows = a + bX + bX Assuming the equation calculated from the statistical software is = 6.3 + 0.2X - 0.001X and the values of the coefficients are found using the least squares method. b indicates that the increase by 1 unit of X results in 0.2 units of Y when X is held constant. Multiple Standard Error of Estimate This is a measure of how well the equation fits the data which is comparable to the standard deviation. The standard deviation uses squared deviations from the mean (Y-) whereas the standard error of estimate utilizes squared deviations from the regression line, (Y-). Multiple Standard Error of Estimate

Sy.123....k =

Annova Table Source Regression df K SS SSR MS MSR = SSR/K F MSR/ MSE

Residual or Error Total

MSE = SSE/ (n-(k+1)

Coefficient of Multiple Determination Symbolised with a capital letter R squared, R and range between 0 to 1. A value near 0 indicates little association while a value near 1 means a strong association. Coefficient of multiple detemination R =

Adjusted Coefficient of Determination Te numbe of independent variables in a multiple regression equation makes the coeficient of determination larger. Each new dependant variable causes the predictions to be more accurate which makes SSE smaller and SSR larger . The R therefore increases only because of the total number of independent variables and not because that the independent variable is a good predictor of the dependant variable. Adjusted coefficient of detemination Radj = 1 -

Inferences in Multiple Linear Regression At this point multiple linear regession has been viewed as a way to describe the elationship between a dependant variable and an independent variable. The least squares method has an ability to daw infeences or generalisations about the relationship for an entire population. In multiple linear regression setting the assumption is that there is an unknown population regression equation that relates the dependant variable to the k independent variables. This is called a model of the relationship which is written = + X + X + kXk Global Test We can test the ability of the independent variables X, X Xk to explain the behavior of the dependent variable .

Example Salberry Realty sells homes along the east coast of United States. One of the questions most frequently asked by prospective buyers is: If we purchase this home how can we expect to pay to heat it during winter? The research department of Salsberry has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: 1) the mean daily outside temperature, 2) the number of inches of insulation in attic, and 3) the age in years of the furnace. To investigate Sallsberrys research department selected

a random sample of 20 recently sold homes. It determined the cost to heat each home last January as well as the January outside temperature in the region, the number of inches of insulation of attic and the age of the furnace. The sample information is reported below Home 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Heating cst/ $ 250 360 165 43 92 200 355 290 230 120 73 205 400 320 72 272 94 190 235 139 Mean outside temperature 35 29 36 60 65 30 10 7 21 55 54 48 20 39 60 20 58 40 27 30 Attic/ inches 3 4 7 6 5 5 6 10 9 2 12 5 5 4 8 5 7 8 9 7 Age of Furnace/ Years 6 10 3 9 6 5 7 10 11 5 4 1 15 7 6 8 3 11 8 5

Using excel, the regresion equation was found to be = 427.194 4.583X - 14.831X - 6.101X We can test whether the independent variables effectively estimate home heating costs. Basically we are investigating the possibility of all depependant variables having zero regression coefficients. The null hypothesis is: H: = = = 0 The altenate hypothesis is H: Not all the is are 0. If the null hypothesis is true it means that the independent variables are not useful in predicting the dependent variable. The F distribution at 0.05 level of significance is used to test if the regression coefficients are zero


= 21.9

The computed F value is in the rejection region hence we reject H meaning the coefficients have an ability to predict the heating cost

Evaluating Individual Regression Coeficients So fa we have shown that not all regression coefficients are equal to zero, the next step is to test the independent variables individually to determine the regression coefficients which may be zero and those which ae not. We conduct three separate tests For temperature H : = 0 H : 0 For insulation H: = 0 H: 0 For furnace age H: =0 H: 0

We wil test the hypothesis at 0.05 level using a two tailed test for the student t distribution with n-(k+1) degrees of freedom which is 20-(3+1) = 16 degrees of freedom.

t= Sbi is the standad deviation of the regression coefficient. t=

= -5.936

rejection region = t 2.12 or t 2.12 Reject H hence 0