Você está na página 1de 8

Ben Adams Stat 462 August 9, 2013 Final Project: Boston Housing

Introduction: First analyzed in 1978, the U.S. Census Service compiled a data set concerning housing in the area of Boston, Massachusetts. After being slightly modified, the final data set includes 494 observations. The variable of interest in this project is the square root median value of owner-occupied homes in $1000s (Sqrt_MEDV). To properly analyze this variable we are given 13 possible predictors:

1.CRIM: per capita crime rate by town 2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS: proportion of non-retail business acres per town 4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX: nitric oxides concentration (parts per 10 million) 6. RM: average number of rooms per dwelling 7. AGE: proportion of owner-occupied units built prior to 1940 8. DIS: weighted distances to five Boston employment centres 9. RAD: index of accessibility to radial highways 10. TAX: full-value property-tax rate per $10,000 11. PTRATIO: pupil-teacher ratio by town 12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. Log_LSTAT: nature log of % lower status of the population
The objective is to find the most accurate and precise linear regression model to predict housing prices with the given data and properly analyze it. Methodology: The very first step I took was to fit a regression model with all 13 predictors to get a feel for the data.

The first things I noticed were that a) The p-value in the ANOVA is 0 which means at least one of the x-values is a useful predictor and b) the p-values for ZN, INDUS, and AGE were all greater than 0.05. This led me to conduct an F-test: Ho: B2=B3=B7=0 HA: At least one of them does not = 0 37.386+34.55+3.088 (The seqSS) = 75.025 75.025/3 = 25.008 25.008/0.114 (MSE) = 219.37 = F Comparing this with the p-value from minitab (approximately 0), we fail to reject the null hypothesis and can delete these variables from our data because they are not significantly related. To double check that these variables were useless I ran a best subset on minitab and this was the output:

Vars 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13

R-Sq 72.3 54.2 76.7 76.3 79.9 79.8 82.4 82.2 83.6 83.1 84.7 84.1 85.3 85.1 85.8 85.6 86.2 85.9 86.4 86.3 86.4 86.4 86.4 86.4 86.5

R-Sq(adj) 72.2 54.1 76.6 76.2 79.8 79.7 82.2 82.0 83.5 83.0 84.5 83.9 85.1 84.8 85.6 85.4 86.0 85.7 86.1 86.0 86.1 86.1 86.1 86.1 86.1

Mallows Cp 492.1 1134.5 336.4 351.1 225.1 230.4 140.2 147.3 98.0 115.7 61.0 82.9 41.8 51.6 27.7 34.7 14.2 24.7 9.7 15.1 10.6 11.5 12.2 12.4 14.0

S 0.47776 0.61445 0.43817 0.44207 0.40736 0.40886 0.38204 0.38422 0.36865 0.37422 0.35642 0.36354 0.34971 0.35301 0.34460 0.34698 0.33962 0.34324 0.33771 0.33960 0.33767 0.33797 0.33789 0.33793 0.33815

X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X

I went through every pair and highlighted the ones with the higher R-sq and lowest CP value. For every option with 10 variables or less, ZN, INDUS, and AGE were not included in any of them which reaffirmed that I could take them out of my model. At this point I have 10 predictors remaining.

I then fit a regression model with the 10 remaining predictors and got the following output:

This time all of the p-values are under 0.05. With nothing really seeming out of place I then went back to the best subset method and got the output:
Vars 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 R-Sq 72.3 54.2 76.7 76.3 79.9 79.8 82.4 82.2 83.6 83.1 84.7 84.1 85.3 85.1 85.8 85.6 86.2 85.9 86.4 R-Sq(adj) 72.2 54.1 76.6 76.2 79.8 79.7 82.2 82.0 83.5 83.0 84.5 83.9 85.1 84.8 85.6 85.4 86.0 85.7 86.1 Mallows Cp 494.7 1138.7 338.6 353.3 227.0 232.2 141.8 149.0 99.5 117.2 62.5 84.3 43.2 53.0 29.0 36.0 15.5 26.0 11.0 S 0.47776 0.61445 0.43817 0.44207 0.40736 0.40886 0.38204 0.38422 0.36865 0.37422 0.35642 0.36354 0.34971 0.35301 0.34460 0.34698 0.33962 0.34324 0.33771 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

X X X X X X X X X X

X X X X X X X X X

Out of all the possibilities the lowest Cp and highest R-sq belonged to the model with all 10 remaining variables with the next best being the 9-variable model. I then calculated the AIC and BIC of both models AIC n*ln(SSE/n) +2p 494ln(55.826) 494ln(494) + 2(10) = -1057.07 494ln(55.085) 494ln(494) + 2(11) = -1061.67 BIC n*ln(SSE/n) + ln*n*p 494ln(55.826) 494ln(494) + 10ln(494) = -1015.04 494ln(55.826) 494ln(494) + 11ln(494) = -1015.44

9-variable 10-variable

Because the 10 variable model has a lower AIC and BIC we can conclude it is a better fit model for the data than the 9 variable model.

I wanted to double check that the 10 variable model with all the predictors excluding ZN, INDUS, and AGE was the best, so on minitab I did a backwards elimination approach.

Once again it lead to the 10 variable model being superior with AGE being eliminated first followed by INDUS and then finally ZN. I then went back and looked at the VIF values which were all under 10 which means there are no problems with multicollinearity. I also went to look at normal probability plots and residual plots and I noticed that realizing that this isnt a perfect data set, assumptions werent really violated and they seemed to be fairly normal. For example:

(NOX)

(B) *Outliers were already deleted from the original data set Conclusion: After sorting and sifting through data and carefully analyzing it, Ive concluded the best model contains the 10 variables: CRIM, CHAS, NOX, RM, DIS, RAD, TAX, PTRATIO, B, and log_LSTAT. The square root median value of owner-occupied homes in $1000s (Sqrt_MEDV) can best be predicted by the equation:

sqrt_MEDV = 6.55 - 0.0169 CRIM + 0.162 CHAS - 1.05 NOX + 0.340 RM - 0.0900 DIS
+ 0.0236 RAD - 0.00115 TAX - 0.0801 PTRATIO + 0.00127 B - 0.762 log_LSTAT

I then found the average value of each predictor and plugged it in as a new observation to find a 95% confidence interval and prediction interval

So we can say with 95% confidence that Sqrt_MEDV should fall in between 4.6134 and 4.6732 while a new observation were 95% confident will fall in between 3.9791 and 5.3076.

In conclusion it takes 10 variables to predict the square root of median value of owner occupied homes:

1.CRIM: per capita crime rate by town 2. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 3. NOX: nitric oxides concentration (parts per 10 million) 4. RM: average number of rooms per dwelling 5. DIS: weighted distances to five Boston employment centres 6. RAD: index of accessibility to radial highways 7. TAX: full-value property-tax rate per $10,000 8. PTRATIO: pupil-teacher ratio by town 9. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 10. Log_LSTAT: nature log of % lower status of the population

Você também pode gostar