Você está na página 1de 35

05/04/2017 FinalProjectRegressionModels

FinalProjectRegressionModels
1Background
Asastatisticalconsultantworkingforarealestateinvestmentfirm,yourtaskistodevelopamodeltopredict
thesellingpriceofagivenhomeinAmes,Iowa.Youremployerhopestousethisinformationtohelpassess
whethertheaskingpriceofahouseishigherorlowerthanthetruevalueofthehouse.Ifthehomeis
undervalued,itmaybeagoodinvestmentforthefirm.

2TrainingDataandrelevantpackages
Inordertobetterassessthequalityofthemodelyouwillproduce,thedatahavebeenrandomlydividedinto
threeseparatepieces:atrainingdataset,atestingdataset,andavalidationdataset.Fornowwewillload
thetrainingdataset,theotherswillbeloadedandusedlater.

load("ames_train.Rdata")

Usethecodeblockbelowtoloadanynecessarypackages

library(statsr)
library(dplyr)
library(BAS)
library(gridExtra)#packagetoplot2ggplotatonce
library(ggplot2)
library(corrplot)
library(caret)

2.1Part1ExploratoryDataAnalysis(EDA)
Whenyoufirstgetyourdata,itsverytemptingtoimmediatelybeginfittingmodelsandassessinghowthey
perform.However,beforeyoubeginmodeling,itsabsolutelyessentialtoexplorethestructureofthedata
andtherelationshipsbetweenthevariablesinthedataset.

DoadetailedEDAoftheames_traindataset,tolearnaboutthestructureofthedataandtherelationships
betweenthevariablesinthedataset(refertoIntroductiontoProbabilityandData,Week2,forareminder
aboutEDAifneeded).YourEDAshouldinvolvecreatingandreviewingmanyplots/graphsandconsidering
thepatternsandrelationshipsyousee.

Afteryouhaveexploredcompletely,submitthethreegraphs/plotsthatyoufoundmostinformativeduring
yourEDAprocess,andbrieflyexplainwhatyoulearnedfromeach(whyyoufoundeachinformative).

2.1.1a.Priceorlog(price)?
First,Ievaluatedthedistributionofpriceinordertodecideifitneededsomekindoftransformation.Bythe
twographsbelow,weseethatthedistributionoflog(price)ismuchmorenormallydistributedthanitsnormal
distribution.Afteranalyzingthisplot,Idecidedtomodellog(price).

summary(ames_train$price)

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 1/35
05/04/2017 FinalProjectRegressionModels

##Min.1stQu.MedianMean3rdQu.Max.
##12790129800159500181200213000615000

summary(log(ames_train$price)+1)

##Min.1stQu.MedianMean3rdQu.Max.
##10.4612.7712.9813.0213.2714.33

plot1<ggplot(ames_train,aes(price))+geom_histogram()+ggtitle('Distributionofprice')
plot2<ggplot(ames_train,aes(log(price)+1))+geom_histogram()+ggtitle('Distributionofl
og(price)')
grid.arrange(plot1,plot2)

##`stat_bin()`using`bins=30`.Pickbettervaluewith`binwidth`.
##`stat_bin()`using`bins=30`.Pickbettervaluewith`binwidth`.

2.1.2b.NumericalCorrelations
Next,wewillchechthenumericalcorrelationbetweenthenumericvariablesandtheircorrelationwiththe
dependentvariablelog(price).First,wecheckwhichcolumnhasNAandthepercentageofNAinsideit.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 2/35
05/04/2017 FinalProjectRegressionModels

integer.cols.index<which(sapply(ames_train,class)=='integer')
ames_train.numeric<ames_train[,integer.cols.index]

NApercentage<sapply(ames_train.numeric,function(col){
NAcount<sum(is.na(col))
if(NAcount==0){result<0}
else{
result<NAcount/length(col);
}
invisible(result)
});
perc.missing<NApercentage[NApercentage!=0]
perc.missing

##Lot.FrontageMas.Vnr.AreaBsmtFin.SF.1BsmtFin.SF.2Bsmt.Unf.SF
##0.1670.0070.0010.0010.001
##Total.Bsmt.SFBsmt.Full.BathBsmt.Half.BathGarage.Yr.BltGarage.Cars
##0.0010.0010.0010.0480.001
##Garage.Area
##0.001

Iseethereisnoneedtoremoveapredictorvarible,asthecolumnwithmostNAsonlyhas16%ofNAs.So,
inordertocreateanumericalcorrelationmatrix,IdecidedtoremovetherowsthathadanyNAsinit.The
procedureremoved219rows.Therearemultipleapproachesintheliteratureshowinghowtodealwith
missingvalues,andremovingthevariablesisoneofmostbasicandnotsogoodtechniques,butasafirst
basicanalysisonthedataset,wewilldoitanyway.

na.rows<which(apply(ames_train.numeric,1,function(row){
any(is.na(row))
}))
ames_train.numeric<ames_train.numeric[na.rows,]
ames_train.numeric$log.price<log(ames_train.numeric$price)

corplot<corrplot(cor(ames_train.numeric),title='Correlationbetweennumerical
variables')

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 3/35
05/04/2017 FinalProjectRegressionModels

Finally,Iselectedonlythevariablescorrelationswithlog.priceandfilteredonlyvariablesthathadcorrelation
>0.4.

corplot<corplot[which(rownames(corplot)=='price'),which(colnames(corplot)=='price')]
price.corr<corplot['log.price',]
corplot.df<data.frame(price.corr=price.corr,var=names(price.corr))

reorderedLevels<corplot.df$var[order(corplot.df$price.corr,decreasing=TRUE)]
corplot.df$var<factor(corplot.df$var,levels=reorderedLevels)
corplot.df.final<corplot.df%>%filter(abs(price.corr)>0.4)

ggplot(corplot.df.final,aes(var,price.corr))+
geom_bar(stat='identity')+
ggtitle('Numericvariablecorrelationwithlog(price)')+
theme(axis.text.x=element_text(angle=45,hjust=1,vjust=0.5))

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 4/35
05/04/2017 FinalProjectRegressionModels

2.1.3c.CategoricalCorrelations
First,Iselectedonlythecategoricalvariablesfromtheoriginaldataset,thenIremovedthecolumnsthathad
morethan80%ofitsvaluesmissing.Beforecalculatingthevariablescapabilitytodescribelog(price),I
checkediftherewereanynearzerovariancevariables,i.e.,variablesthat,comparingwithitstotalpossible
values,hadallofitsactualvaluesconcentratedalongonecategory.Ifyouchecktheutilitiesvariables,you
willseethatallofitsvaluesareconcentradedinonecategoryAllPub.Thenextstepwastotrytopredict
log(price)usingeachcategoricalpredictorinordertoobservewhichvariablecouldbetterexplainthe
log(price)variation.Thefitoftheregressionwasmeasuredbytheadjusted\(R^2\)andthevariableswith
thehighestfitwereplotted.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 5/35
05/04/2017 FinalProjectRegressionModels

#Getcharactervariables
categ.cols.index<which(sapply(ames_train,class)=='factor')
ames_train.categ<ames_train[,categ.cols.index]

#removecolumnsthathavemorethan0.8%ofNA's
perc80.na.cols<which(apply(ames_train.categ,2,function(row){
sum(is.na(row))/nrow(ames_train.categ)>0.8
}))
ames_train.categ<ames_train.categ[,perc80.na.cols]

#e.g.utilitieswaszerovariance
nzvVars<nzv(ames_train.categ)
ames_train.categ<ames_train.categ[,nzvVars]#removenzvcolumns

#associationwiththepricevariable
rScores<sapply(ames_train.categ,
function(col){
df<data.frame(log.price=log(ames_train$price),col=col)
model<lm(log.price~col,df)
summary(model)$adj.r.squared
})

catcorr.df<data.frame(price.corr=rScores,var=names(rScores))
reorderedLevels<catcorr.df$var[order(catcorr.df$price.corr,decreasing=TRUE)]
catcorr.df$var<factor(catcorr.df$var,levels=reorderedLevels)
catcorr.df<catcorr.df%>%filter(abs(price.corr)>0.4)

ggplot(catcorr.df,aes(var,price.corr))+
geom_bar(stat='identity')+
ggtitle('R^2categoricalvariablewhenexplainingwithprice')+
theme(axis.text.x=element_text(angle=45,hjust=1,vjust=0.5))

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 6/35
05/04/2017 FinalProjectRegressionModels

**

2.2Part2Developmentandassessmentofan
initialmodel,followingasemiguidedprocessof
analysis
2.2.1Section2.1AnInitialModel
Inbuildingamodel,itisoftenusefultostartbycreatingasimple,intuitiveinitialmodelbasedontheresults
oftheexploratorydataanalysis.(Note:Thegoalatthisstageisnottoidentifythebestpossiblemodelbut
rathertochooseareasonableandunderstandablestartingpoint.Lateryouwillexpandandrevisethis
modeltocreateyourfinalmodel.

BasedonyourEDA,selectatmost10predictorvariablesfromames_trainandcreatealinearmodelfor
price (oratransformedversionofprice)usingthosevariables.ProvidetheRcodeandthesummary
outputtableforyourmodel,abriefjustificationforthevariablesyouhavechosen,andabriefdiscussionof
themodelresultsincontext(focusedonthevariablesthatappeartobeimportantpredictorsandhowthey
relatetosalesprice).

BecauseIthinkusingnumericalvariablesareeasierthancategoricalones,Idecidedtofittheinitialmodel
usingthe10bestnumericalvariablescorrelatedwithlog(price).Tohaveabetterinterepretationofthe
coefficients,includingtheinterceptterm,Ifirstcentralizedthepredictorvariablessowecaninterpretthe
betasasvariationsfromthemeanpredictedvaluelog(price).

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 7/35
05/04/2017 FinalProjectRegressionModels

#workwithcenteredvariables,exceptforprice
ames_train.numeric.centered<as.data.frame(scale(ames_train.numeric,center=TRUE,scale=
FALSE))
ames_train.numeric.centered$price<ames_train.numeric$price

modelq2Log.centered<lm(log(price)~Mas.Vnr.Area+Garage.Yr.Blt+Year.Remod.Add+
Year.Built+Garage.Area+Garage.Cars+X1st.Flr.SF+
Total.Bsmt.SF+area+Overall.Qual,
data=ames_train.numeric.centered)
summary(modelq2Log.centered)

##
##Call:
##lm(formula=log(price)~Mas.Vnr.Area+Garage.Yr.Blt+Year.Remod.Add+
##Year.Built+Garage.Area+Garage.Cars+X1st.Flr.SF+Total.Bsmt.SF+
##area+Overall.Qual,data=ames_train.numeric.centered)
##
##Residuals:
##Min1QMedian3QMax
##1.806320.079300.010320.091980.46932
##
##Coefficients:
##EstimateStd.ErrortvaluePr(>|t|)
##(Intercept)1.204e+016.217e031935.833<2e16***
##Mas.Vnr.Area8.457e073.944e050.0210.98290
##Garage.Yr.Blt1.722e045.444e040.3160.75181
##Year.Remod.Add2.495e034.244e045.8796.16e09***
##Year.Built1.861e034.588e044.0565.50e05***
##Garage.Area8.081e056.747e051.1980.23140
##Garage.Cars1.527e022.018e020.7570.44944
##X1st.Flr.SF8.500e053.233e052.6290.00873**
##Total.Bsmt.SF1.368e042.862e054.7802.10e06***
##area2.260e041.859e0512.161<2e16***
##Overall.Qual1.075e017.405e0314.524<2e16***
##
##Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1
##
##Residualstandarderror:0.1737on770degreesoffreedom
##MultipleRsquared:0.8363,AdjustedRsquared:0.8342
##Fstatistic:393.5on10and770DF,pvalue:<2.2e16

Bythesummaryofthemodel,wecanseesomevariablesthatreceivemoreimportancethanothers.Aswe
areworkingwithlog(price)andcenteredpredictors(http://stats.idre.ucla.edu/other/mult
pkg/faq/general/faqhowdoiinterpretaregressionmodelwhensomevariablesarelogtransformed/),we
canmakethefollowingassumptions:*InterceptThevalueexp(\(\beta_{0}\))=exp(1.204e+01)=169396.9
canbeconsideredtheunconditionalgeometricmean(i.e.atypeofcentermeasureandnotrelatedtoany
predictorvariableofX)ofYprice*Year.Remod.AddandYear.BuildInthiscase,weseethattheyearthe
housewasbuildandtheyearitwasremodeledreallyimpactedtheprice,andeachvariationfromthemean
ofthesepredictors,andkeepingallotherpredictorsconstant,changethepriceinanorderof(exp(\
(\beta_{i}\))1)*100=0.24%and0.18%respectivelly.*X1st.Flr.SFAsmallbutimportantpredictor,the
sizeofthefirstfloorinsquarefeetshowedasaimportantvariabletomodelthepriceofahouse.
Consideringallothervariablesconstant,aunitchangeinX1st.Flr.SFfromthemean,wouldimpacttheprice
ofthehousein0.008%.Thismightnotseemsmuchbut,consideringtherangeofthiscenteredvariableof
[757.901973.00],theycanimpactthepriceinarangeof6%to15.78%ofthetotalprice.*Total.Bsmt.SF
andareafollowthesameprincipleand,keepingeverythingelseconstant,eachunitvariationdeviatingfrom
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 8/35
05/04/2017 FinalProjectRegressionModels

themeanpromote1.01%and1.02%changeinpricerespectfully*Lastly,Ideservedaseparatedsectionfor
theOverall.Qualcoefficient.Inthiscontextofcentralized(butnotstandardized,sowecannotinfer
coefficientimportance)logregression,thiswasthebiggestcoefficient.Consideringtherangeofthe
centralizedvariablein[4.1680,3.8320]andtheinterpretationofthecoefficientas(exp(\
(\beta_{Overal.Qual}\))1)*100=11.34%(eachunitchangeimpactthepredictedpricein11.34%),this
variablealonecanimpactthefinalpredictedpriceinarangeof[47.30%,43.48%]

Insummary,itseemsthatrealestateemployeesfocusmostlyonthesizeofthesize,itsageandageneral
qualitymeasuredwhich,inmyopinion,givesanaveragescoreforalltheitensandluxuriesahousecould
have.

2.2.2Section2.2ModelSelection
Noweitherusing BAS anotherstepwiseselectionprocedurechoosethebestmodelyoucan,usingyour
initialmodelasyourstartingpoint.Tryatleasttwodifferentmodelselectionmethodsandcomparetheir
results.Dotheybotharriveatthesamemodelordotheydisagree?Whatdoyouthinkthismeans?

Formodelselection,ItriedastepwisevariableselectionwithbothAICandBICcriteriaandbotharrivedat
thesamemodel,whichcanbeseenbellow.AsbothgavethesameresultBUTalsohadabetterfittothe
trainingmodel,IselectedtocontinuethemodellingprocedurewiththeAICmodel(personalchoice,couldbe
bothAICorBIC).

#stepwise
null<lm(log(price)~1,data=ames_train.numeric.centered)
upperModel<lm(log(price)~Mas.Vnr.Area+Garage.Yr.Blt+Year.Remod.Add+
Year.Built+Garage.Area+Garage.Cars+X1st.Flr.SF+
Total.Bsmt.SF+area+Overall.Qual,
data=ames_train.numeric.centered)

##Mas.Vnr.Area,Garage.Yr.BltandGarage.Carswereleftoutonbothmodels
aic<step(null,scope=list(upper=upperModel),
data=ames_train.numeric.centered,direction='both')

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 9/35
05/04/2017 FinalProjectRegressionModels

##Start:AIC=1329.34
##log(price)~1
##
##DfSumofSqRSSAIC
##+Overall.Qual199.42142.5932267.8
##+area173.80868.2051900.1
##+Total.Bsmt.SF166.37075.6431819.3
##+Garage.Cars165.60776.4071811.4
##+X1st.Flr.SF164.51077.5041800.3
##+Garage.Area158.20383.8101739.2
##+Year.Built157.41284.6011731.9
##+Year.Remod.Add155.43586.5791713.8
##+Garage.Yr.Blt153.61388.4001697.6
##+Mas.Vnr.Area132.849109.1641532.8
##<none>142.0131329.3
##
##Step:AIC=2267.85
##log(price)~Overall.Qual
##
##DfSumofSqRSSAIC
##+X1st.Flr.SF19.29133.3012458.0
##+area19.28233.3112457.8
##+Total.Bsmt.SF17.99134.6012428.1
##+Garage.Cars16.09636.4972386.5
##+Garage.Area16.03036.5632385.1
##+Year.Remod.Add13.48439.1092332.5
##+Garage.Yr.Blt13.06339.5302324.1
##+Year.Built12.95739.6362322.0
##+Mas.Vnr.Area11.47341.1192293.3
##<none>42.5932267.8
##Overall.Qual199.421142.0131329.3
##
##Step:AIC=2458.04
##log(price)~Overall.Qual+X1st.Flr.SF
##
##DfSumofSqRSSAIC
##+area14.24029.0612562.4
##+Year.Remod.Add13.29730.0042537.5
##+Garage.Cars12.51730.7842517.4
##+Garage.Yr.Blt12.51330.7882517.3
##+Year.Built12.48030.8222516.5
##+Garage.Area11.91831.3832502.4
##+Total.Bsmt.SF10.54332.7582468.9
##+Mas.Vnr.Area10.21333.0882461.1
##<none>33.3012458.0
##X1st.Flr.SF19.29142.5932267.8
##Overall.Qual144.20277.5041800.3
##
##Step:AIC=2562.41
##log(price)~Overall.Qual+X1st.Flr.SF+area
##
##DfSumofSqRSSAIC
##+Year.Built13.858325.2032671.7
##+Garage.Yr.Blt13.142025.9192649.8
##+Year.Remod.Add13.071525.9902647.7
##+Garage.Cars11.469427.5922600.9
##+Garage.Area11.260527.8012595.0

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 10/35
05/04/2017 FinalProjectRegressionModels

##+Total.Bsmt.SF11.179627.8822592.8
##<none>29.0612562.4
##+Mas.Vnr.Area10.030929.0302561.2
##area14.240033.3012458.0
##X1st.Flr.SF14.249433.3112457.8
##Overall.Qual125.462554.5242073.0
##
##Step:AIC=2671.66
##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built
##
##DfSumofSqRSSAIC
##+Year.Remod.Add10.989724.2132700.9
##+Total.Bsmt.SF10.588324.6152688.1
##+Garage.Area10.316524.8862679.5
##+Garage.Cars10.296724.9062678.9
##+Garage.Yr.Blt10.115825.0872673.2
##<none>25.2032671.7
##+Mas.Vnr.Area10.000525.2022669.7
##X1st.Flr.SF13.353028.5562576.1
##Year.Built13.858329.0612562.4
##area15.618830.8222516.5
##Overall.Qual19.711334.9142419.1
##
##Step:AIC=2700.94
##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built+
##Year.Remod.Add
##
##DfSumofSqRSSAIC
##+Total.Bsmt.SF10.746523.4672723.4
##+Garage.Area10.262123.9512707.4
##+Garage.Cars10.211424.0022705.8
##<none>24.2132700.9
##+Garage.Yr.Blt10.023524.1902699.7
##+Mas.Vnr.Area10.004424.2092699.1
##Year.Remod.Add10.989725.2032671.7
##Year.Built11.776525.9902647.7
##X1st.Flr.SF13.527127.7402596.7
##area15.001829.2152556.3
##Overall.Qual17.951232.1642481.2
##
##Step:AIC=2723.4
##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built+
##Year.Remod.Add+Total.Bsmt.SF
##
##DfSumofSqRSSAIC
##+Garage.Area10.203223.2642728.2
##+Garage.Cars10.179423.2872727.4
##<none>23.4672723.4
##+Garage.Yr.Blt10.018023.4492722.0
##+Mas.Vnr.Area10.000823.4662721.4
##X1st.Flr.SF10.287823.7552715.9
##Total.Bsmt.SF10.746524.2132700.9
##Year.Remod.Add11.147924.6142688.1
##Year.Built11.305924.7732683.1
##area15.407228.8742563.4
##Overall.Qual16.715030.1822528.9
##
##Step:AIC=2728.19
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 11/35
05/04/2017 FinalProjectRegressionModels

##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built+
##Year.Remod.Add+Total.Bsmt.SF+Garage.Area
##
##DfSumofSqRSSAIC
##<none>23.2642728.2
##+Garage.Cars10.016823.2472726.8
##+Garage.Yr.Blt10.002523.2612726.3
##+Mas.Vnr.Area10.000023.2642726.2
##Garage.Area10.203223.4672723.4
##X1st.Flr.SF10.213823.4772723.1
##Total.Bsmt.SF10.687523.9512707.4
##Year.Built11.010924.2742697.0
##Year.Remod.Add11.088524.3522694.5
##area14.830728.0942582.8
##Overall.Qual16.584229.8482535.6

bic<step(null,scope=list(upper=upperModel),data=ames_train.numeric.centered,
direction='both',
k=log(nrow(ames_train.numeric)))

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 12/35
05/04/2017 FinalProjectRegressionModels

##Start:AIC=1324.67
##log(price)~1
##
##DfSumofSqRSSAIC
##+Overall.Qual199.42142.5932258.5
##+area173.80868.2051890.8
##+Total.Bsmt.SF166.37075.6431810.0
##+Garage.Cars165.60776.4071802.1
##+X1st.Flr.SF164.51077.5041791.0
##+Garage.Area158.20383.8101729.9
##+Year.Built157.41284.6011722.5
##+Year.Remod.Add155.43586.5791704.5
##+Garage.Yr.Blt153.61388.4001688.2
##+Mas.Vnr.Area132.849109.1641523.5
##<none>142.0131324.7
##
##Step:AIC=2258.53
##log(price)~Overall.Qual
##
##DfSumofSqRSSAIC
##+X1st.Flr.SF19.29133.3012444.1
##+area19.28233.3112443.8
##+Total.Bsmt.SF17.99134.6012414.2
##+Garage.Cars16.09636.4972372.5
##+Garage.Area16.03036.5632371.1
##+Year.Remod.Add13.48439.1092318.5
##+Garage.Yr.Blt13.06339.5302310.2
##+Year.Built12.95739.6362308.1
##+Mas.Vnr.Area11.47341.1192279.4
##<none>42.5932258.5
##Overall.Qual199.421142.0131324.7
##
##Step:AIC=2444.06
##log(price)~Overall.Qual+X1st.Flr.SF
##
##DfSumofSqRSSAIC
##+area14.24029.0612543.8
##+Year.Remod.Add13.29730.0042518.8
##+Garage.Cars12.51730.7842498.8
##+Garage.Yr.Blt12.51330.7882498.7
##+Year.Built12.48030.8222497.8
##+Garage.Area11.91831.3832483.7
##+Total.Bsmt.SF10.54332.7582450.2
##<none>33.3012444.1
##+Mas.Vnr.Area10.21333.0882442.4
##X1st.Flr.SF19.29142.5932258.5
##Overall.Qual144.20277.5041791.0
##
##Step:AIC=2543.76
##log(price)~Overall.Qual+X1st.Flr.SF+area
##
##DfSumofSqRSSAIC
##+Year.Built13.858325.2032648.3
##+Garage.Yr.Blt13.142025.9192626.5
##+Year.Remod.Add13.071525.9902624.3
##+Garage.Cars11.469427.5922577.6
##+Garage.Area11.260527.8012571.7

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 13/35
05/04/2017 FinalProjectRegressionModels

##+Total.Bsmt.SF11.179627.8822569.5
##<none>29.0612543.8
##+Mas.Vnr.Area10.030929.0302537.9
##area14.240033.3012444.1
##X1st.Flr.SF14.249433.3112443.8
##Overall.Qual125.462554.5242059.0
##
##Step:AIC=2648.35
##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built
##
##DfSumofSqRSSAIC
##+Year.Remod.Add10.989724.2132673.0
##+Total.Bsmt.SF10.588324.6152660.1
##+Garage.Area10.316524.8862651.6
##+Garage.Cars10.296724.9062650.9
##<none>25.2032648.3
##+Garage.Yr.Blt10.115825.0872645.3
##+Mas.Vnr.Area10.000525.2022641.7
##X1st.Flr.SF13.353028.5562557.5
##Year.Built13.858329.0612543.8
##area15.618830.8222497.8
##Overall.Qual19.711334.9142400.5
##
##Step:AIC=2672.98
##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built+
##Year.Remod.Add
##
##DfSumofSqRSSAIC
##+Total.Bsmt.SF10.746523.4672690.8
##+Garage.Area10.262123.9512674.8
##+Garage.Cars10.211424.0022673.2
##<none>24.2132673.0
##+Garage.Yr.Blt10.023524.1902667.1
##+Mas.Vnr.Area10.004424.2092666.5
##Year.Remod.Add10.989725.2032648.3
##Year.Built11.776525.9902624.3
##X1st.Flr.SF13.527127.7402573.4
##area15.001829.2152533.0
##Overall.Qual17.951232.1642457.9
##
##Step:AIC=2690.78
##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built+
##Year.Remod.Add+Total.Bsmt.SF
##
##DfSumofSqRSSAIC
##+Garage.Area10.203223.2642690.9
##<none>23.4672690.8
##+Garage.Cars10.179423.2872690.1
##X1st.Flr.SF10.287823.7552687.9
##+Garage.Yr.Blt10.018023.4492684.7
##+Mas.Vnr.Area10.000823.4662684.1
##Total.Bsmt.SF10.746524.2132673.0
##Year.Remod.Add11.147924.6142660.1
##Year.Built11.305924.7732655.1
##area15.407228.8742535.5
##Overall.Qual16.715030.1822500.9
##
##Step:AIC=2690.91
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 14/35
05/04/2017 FinalProjectRegressionModels

##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built+
##Year.Remod.Add+Total.Bsmt.SF+Garage.Area
##
##DfSumofSqRSSAIC
##<none>23.2642690.9
##Garage.Area10.203223.4672690.8
##X1st.Flr.SF10.213823.4772690.4
##+Garage.Cars10.016823.2472684.8
##+Garage.Yr.Blt10.002523.2612684.3
##+Mas.Vnr.Area10.000023.2642684.2
##Total.Bsmt.SF10.687523.9512674.8
##Year.Built11.010924.2742664.3
##Year.Remod.Add11.088524.3522661.8
##area14.830728.0942550.2
##Overall.Qual16.584229.8482502.9

#printresults
aic

##
##Call:
##lm(formula=log(price)~Overall.Qual+X1st.Flr.SF+area+
##Year.Built+Year.Remod.Add+Total.Bsmt.SF+Garage.Area,
##data=ames_train.numeric.centered)
##
##Coefficients:
##(Intercept)Overall.QualX1st.Flr.SFarea
##1.204e+011.080e018.559e052.285e04
##Year.BuiltYear.Remod.AddTotal.Bsmt.SFGarage.Area
##1.794e032.493e031.365e041.117e04

bic

##
##Call:
##lm(formula=log(price)~Overall.Qual+X1st.Flr.SF+area+
##Year.Built+Year.Remod.Add+Total.Bsmt.SF+Garage.Area,
##data=ames_train.numeric.centered)
##
##Coefficients:
##(Intercept)Overall.QualX1st.Flr.SFarea
##1.204e+011.080e018.559e052.285e04
##Year.BuiltYear.Remod.AddTotal.Bsmt.SFGarage.Area
##1.794e032.493e031.365e041.117e04

2.2.3Section2.3InitialModelResiduals
Onewaytoassesstheperformanceofamodelistoexaminethemodelsresiduals.Inthespacebelow,
createaresidualplotforyourpreferredmodelfromaboveanduseittoassesswhetheryourmodelappears
tofitthedatawell.Commentonanyinterestingstructureintheresidualplot(trend,outliers,etc.)andbriefly
discusspotentialimplicationsitmayhaveforyourmodelandinference/predictionyoumightproduce.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 15/35
05/04/2017 FinalProjectRegressionModels

Fromtheresidualplot,wecanseethattheyfollowanormaldistributionanddonotseemtohaveanynon
lineartrend.Alreadyinthefirstplot,wecanseetheremightbethreeoutliers(rows251,343and582)who
caninfluencenegativelyinourleastsquaresestimators.Fornow,Ikeepitastheyare,butIwillevaluate
thenafter.

Thenormalqqplotconfirmthattheresidualnormalityassumptionsisnotsoapartfromtheidealgaussian
distribution.

Lastly,thecooksdistanceshowthattwoofthethreepossibleoutlierhaveahighleverage.Iwillevaluate
thenshortly.*

par(mfrow=c(2,2))
plot(aic)

par(mfrow=c(1,1))

2.2.4Section2.4InitialModelRMSE
Youcancalculateitdirectlybasedonthemodeloutput.BespecificabouttheunitsofyourRMSE(depending
onwhetheryoutransformedyourresponsevariable).Thevalueyoureportwillbemoremeaningfulifitisin
theoriginalunits(dollars).

TheRMSEforthetrainingdatais40297.18.TheRMSEcanbeinterpretedasageneraldatastandard
deviationfromtheexpectedvalue.Inthissense,inaverage,themodelpredictsthehousepricewitha
deviationof+40297.18dollars.Rememberingthatthismeasureissensibletooutliers,suchaswehave
seenbut,overall,isagood(small)RMSE,aspricecanrangefrom[12790,615000].
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 16/35
05/04/2017 FinalProjectRegressionModels

RMSE<sqrt(mean((ames_train.numeric.centered$priceexp(aic$fitted.values))^2))
RMSE

##[1]40297.18

2.2.5Section2.5Overfitting
Theprocessofbuildingamodelgenerallyinvolvesstartingwithaninitialmodel(asyouhavedoneabove),
identifyingitsshortcomings,andadaptingthemodelaccordingly.Thisprocessmayberepeatedseveral
timesuntilthemodelfitsthedatareasonablywell.However,themodelmaydowellontrainingdatabut
performpoorlyoutofsample(meaning,onadatasetotherthantheoriginaltrainingdata)becausethe
modelisoverlytunedtospecificallyfitthetrainingdata.Thisiscalledoverfitting.Todeterminewhether
overfittingisoccurringonamodel,comparetheperformanceofamodelonbothinsampleandoutof
sampledatasets.Tolookatperformanceofyourinitialmodelonoutofsampledata,youwillusethedata
set ames_test .

load("ames_test.Rdata")

Useyourmodelfromabovetogeneratepredictionsforthehousingpricesinthetestdataset.Arethe
predictionssignificantlymoreaccurate(comparedtotheactualsalesprices)forthetrainingdatathanthe
testdata?Whyorwhynot?Brieflyexplainhowyoudeterminedthat(whatstepsorprocessesdidyouuse)?

Topredictthetestdata,first,Iselectedagainonlythenumericalvariables,inordertocenterthen,likeIdid
withthetrainingdata.

TheRMSEforthetestdatawas58072.19,abiggerstandarddeviationthatwehaveseeninthetraining
dataset.Thisvariationmakessenseastheordinaryleastsquaresestimatorstrytominimizethesquare
distanceofthepointstotheline(hyperplane)inthetrainingdataandthusisanoptimisticerror.

Theresidualplotshowsaroughtlynormaldistributionaroundzero,contributedbytheqqplot.Wecansee
thereisapossibleoutlierwithinthepredictionsbut,aswearedealingwiththetestdataset,thereisnothing
wecando.

#preprocessdata
ames_test.numeric.centered<ames_test[,integer.cols.index]
ames_test.numeric.centered<as.data.frame(scale(ames_test.numeric.centered,center=TRUE,
scale=FALSE))

predictions<predict(aic,ames_test.numeric.centered)
#logrmse=0.16
RMSEq6<sqrt(mean((ames_test$priceexp(predictions))^2))
RMSEq6

##[1]58072.19

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 17/35
05/04/2017 FinalProjectRegressionModels

residualsq6<predictionslog(ames_test$price)

par(mfrow=c(1,2))
plot(residualsq6,main='Residualsforvalidationdata')
qqnorm(residualsq6)
qqline(residualsq6)

par(mfrow=c(1,1))

Notetothelearner:Ifinreallifepracticethisoutofsampleanalysisshowsevidencethatthetrainingdata
fitsyourmodelalotbetterthanthetestdata,itisprobablyagoodideatogobackandrevisethemodel
(usuallybysimplifyingthemodel)toreducethisoverfitting.Forsimplicity,wedonotaskyoutodothisonthe
assignment,however.

2.3Part3DevelopmentofaFinalModel
Nowthatyouhavedevelopedaninitialmodeltouseasabaseline,createafinalmodelwithatmost20
variablestopredicthousingpricesinAmes,IA,selectingfromthefullarrayofvariablesinthedatasetand
usinganyofthetoolsthatweintroducedinthisspecialization.

Carefullydocumenttheprocessthatyouusedtocomeupwithyourfinalmodel,sothatyoucananswerthe
questionsbelow.

2.3.1Section3.1FinalModel
2.3.1.1OutliersAnalysis
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 18/35
05/04/2017 FinalProjectRegressionModels

First,IanalysedthethreeoutliersofthedatasetinordertodecidewhetherIremovedthen.Afterafirst
inspection,Ifoundsomepotentialvariablesthatmightnotbearepresentantofthedata(extremevalues).

#rows251,343and582
rowsOfInterest<c(251,343,582)
ames_train.numeric[rowsOfInterest,c('area','Year.Built','price')]

###Atibble:3x3
##areaYear.Builtprice
##<int><int><int>
##146762007184750
##2832192312789
##31317192040000

First,fotthe251observation,wecanseethatitisthebiggesthouseinthedataset,andreallyfarapartfrom
thesecondmost,aswecanseeintheplotbelow.So,asthepointisnotsorepresentative,Idecidedto
removeit,asitcanhighinfluencewhentryingtominimizetheleastsquaresestimator.

max(ames_train.numeric$area)

##[1]4676

plot(ames_train.numeric$area,ames_train.numeric$price)

#^decidedtoremoveobs251^

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 19/35
05/04/2017 FinalProjectRegressionModels

Second,points343and582cametomyattentionbythefacttheywerereallyoldhouses,andreallycheap
ones.Infact,selectingotherhousesbuiltinthesameperiod,giveameanpricealothigherthanthesetwo,
moreextremeresultforobservation343.Observation582alsohadalowpricebut,asitapproachesmore
closelytothemeanpriceswiththesameage,Idecidednottoremoveit.

sum(ames_train.numeric$Year.Built<=1923)

##[1]65

summary(ames_train.numeric$price[ames_train.numeric$Year.Built<1923])

##Min.1stQu.MedianMean3rdQu.Max.
##40000105500124800124700148400266000

plot(ames_train.numeric$Year.Built[ames_train.numeric$Year.Built<=1923],
ames_train.numeric$price[ames_train.numeric$Year.Built<=1923],
main='PriceDistributionbyYear.Built',
xlab='Year.Built',
ylab='price')

#^decidedtoremoveobs343,butNOTremove582^

2.3.1.2Finalmodel
Providethesummarytableforyourmodel.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 20/35
05/04/2017 FinalProjectRegressionModels

Finally,Ihadtodotwomoretransformationsbeforerunningtheregression.AsIincludedfactor(character)
variables,someinsights(andproblems)appeared.

IsawtheNAvalueinthebasementqualitymeantnobasementso,inordertousethisinformation
intheregression,ItransformedtheNAvaluesintoafactor.

WhenIincludedNeighborhood,IstartedtogetanerrorofNotfullrankmatrixandthatmeantIhad
linearlydependentvariablesbetweenmypredictors.Inspectingmydata,Ifoundthat,whenconverting
theNeighborhoodcolumntodummyvariables,twoNeighborhoodshadacountof0(GrnHilland
LandMrk)andthatwouldbringmetwocolumnsofonly0(andproblems!).SoIconvertedthesetwo
factortoauniqueoneGrnHill_LandMrk.

Now,timetoregression!Iincludedthefourcategoricalvariablesfromearlieranalysisthataloneexplained
morethan0.5asadjusted\(R^2\).

###RemoveearlyrowswithNAandalsooutliers
ames_train.2<ames_train[c(na.rows,251,343),]

#rescalenumericalvaluestozeromean,exceptforprice
ames_train.2[,integer.cols.index]<scale(ames_train.2[,integer.cols.index],center=TRUE,scal
e=FALSE)
ames_train.2$price<ames_train[c(na.rows,251,343),]$price

#convertNAtoafactor
ames_train.2.Qual_sub<as.character(ames_train.2$Bsmt.Qual)
ames_train.2.Qual_sub[is.na(ames_train.2.Qual_sub)]<'NA'
ames_train.2$Bsmt.Qual<relevel(factor(ames_train.2.Qual_sub),ref='TA')

#createnewfactorGrnHill_LandMrkanddropunusedones.
ames_train.2$Neighborhood<droplevels(ames_train.2$Neighborhood)
ames_train.2$Neighborhood<factor(ames_train.2$Neighborhood,levels=
c(levels(ames_train.2$Neighborhood),'GrnHill_LandMrk'))

testModRel2<lm(log(price)~Fireplaces+BsmtFin.SF.1+TotRms.AbvGrd+
Full.Bath+Mas.Vnr.Area+Garage.Yr.Blt+Year.Remod.Add+
Year.Built+Garage.Area+Garage.Cars+X1st.Flr.SF+
Total.Bsmt.SF+area+Overall.Qual+Neighborhood+
Bsmt.Qual+Exter.Qual+Kitchen.Qual,
data=ames_train.2)

#preparestepwiseAICmodelselection
null<lm(log(price)~1,data=ames_train.2)

#Exter.QualGarage.Yr.BltGarage.CarsMas.Vnr.AreaX1st.Flr.SFBsmt.Qualwereleftout
aic2<step(null,scope=list(upper=testModRel2),data=ames_train.2,direction='both')

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 21/35
05/04/2017 FinalProjectRegressionModels

##Start:AIC=1327.74
##log(price)~1
##
##DfSumofSqRSSAIC
##+Overall.Qual198.73942.5812260.2
##+Neighborhood2586.91654.4032021.4
##+Bsmt.Qual573.88467.4361894.1
##+area173.13568.1851893.5
##+Exter.Qual372.00169.3191876.6
##+Kitchen.Qual469.53871.7821847.4
##+Total.Bsmt.SF165.79275.5271813.8
##+Garage.Cars165.02776.2921806.0
##+X1st.Flr.SF164.00877.3121795.6
##+Garage.Area157.87183.4491736.1
##+Year.Built156.99584.3241728.0
##+Year.Remod.Add155.04386.2761710.2
##+Garage.Yr.Blt153.20888.1111693.8
##+Full.Bath149.29592.0251659.9
##+TotRms.AbvGrd139.422101.8981580.5
##+Fireplaces135.555105.7641551.5
##+Mas.Vnr.Area132.249109.0701527.5
##+BsmtFin.SF.1130.810110.5091517.3
##<none>141.3191327.7
##
##Step:AIC=2260.25
##log(price)~Overall.Qual
##
##DfSumofSqRSSAIC
##+X1st.Flr.SF19.29533.2852450.1
##+area19.27333.3082449.6
##+Neighborhood2510.47032.1102430.1
##+Total.Bsmt.SF17.98834.5922420.1
##+BsmtFin.SF.116.40436.1772385.2
##+Garage.Cars16.08736.4942378.4
##+Garage.Area16.03336.5472377.3
##+TotRms.AbvGrd15.09637.4852357.6
##+Bsmt.Qual54.08838.4932328.9
##+Fireplaces13.52439.0562325.6
##+Year.Remod.Add13.49239.0882324.9
##+Kitchen.Qual43.77238.8082324.5
##+Garage.Yr.Blt13.07039.5112316.5
##+Year.Built12.96539.6152314.5
##+Full.Bath12.80339.7772311.3
##+Exter.Qual32.68939.8922305.1
##+Mas.Vnr.Area11.46241.1192285.5
##<none>42.5812260.2
##Overall.Qual198.739141.3191327.7
##
##Step:AIC=2450.11
##log(price)~Overall.Qual+X1st.Flr.SF
##
##DfSumofSqRSSAIC
##+Neighborhood256.37426.9112565.7
##+area14.24629.0392554.4
##+Year.Remod.Add13.28729.9982529.1
##+Garage.Cars12.51530.7712509.3
##+Garage.Yr.Blt12.50330.7822509.0

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 22/35
05/04/2017 FinalProjectRegressionModels

##+Year.Built12.46930.8162508.2
##+Bsmt.Qual52.74530.5402507.2
##+TotRms.AbvGrd12.35130.9342505.2
##+Kitchen.Qual42.41630.8702500.8
##+Garage.Area11.91931.3662494.4
##+BsmtFin.SF.111.80031.4862491.4
##+Full.Bath11.63031.6552487.2
##+Exter.Qual31.53131.7552480.8
##+Fireplaces11.18332.1022476.3
##+Total.Bsmt.SF10.54632.7392461.0
##+Mas.Vnr.Area10.22233.0632453.3
##<none>33.2852450.1
##X1st.Flr.SF19.29542.5812260.2
##Overall.Qual144.02777.3121795.6
##
##Step:AIC=2565.7
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood
##
##DfSumofSqRSSAIC
##+area14.815922.0952717.3
##+TotRms.AbvGrd12.390524.5212636.2
##+Year.Remod.Add11.736125.1752615.7
##+Kitchen.Qual41.820925.0902612.3
##+BsmtFin.SF.111.282625.6292601.7
##+Garage.Cars11.124925.7862597.0
##+Fireplaces11.099625.8122596.2
##+Garage.Area10.971925.9392592.3
##+Full.Bath10.772626.1392586.4
##+Bsmt.Qual50.939225.9722583.4
##+Garage.Yr.Blt10.564026.3472580.2
##+Year.Built10.326626.5852573.2
##+Exter.Qual30.411326.5002571.7
##+Total.Bsmt.SF10.217626.6942570.0
##+Mas.Vnr.Area10.114326.7972567.0
##<none>26.9112565.7
##Neighborhood256.373933.2852450.1
##X1st.Flr.SF15.198932.1102430.1
##Overall.Qual112.866739.7782263.3
##
##Step:AIC=2717.3
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area
##
##DfSumofSqRSSAIC
##+BsmtFin.SF.111.949120.1462787.2
##+Year.Remod.Add11.402420.6932766.4
##+Kitchen.Qual41.268120.8272755.3
##+Bsmt.Qual51.126420.9692748.1
##+Year.Built10.841321.2542745.5
##+Total.Bsmt.SF10.668221.4272739.2
##+Garage.Yr.Blt10.645421.4502738.4
##+Garage.Area10.422321.6732730.3
##+Exter.Qual30.484321.6112728.6
##+Garage.Cars10.355221.7402727.9
##+Fireplaces10.242521.8532723.9
##<none>22.0952717.3
##+TotRms.AbvGrd10.012122.0832715.7
##+Mas.Vnr.Area10.008022.0872715.6
##+Full.Bath10.006722.0892715.5
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 23/35
05/04/2017 FinalProjectRegressionModels

##X1st.Flr.SF11.517723.6132667.6
##area14.815926.9112565.7
##Neighborhood256.943429.0392554.4
##Overall.Qual16.620228.7152515.2
##
##Step:AIC=2787.24
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area+
##BsmtFin.SF.1
##
##DfSumofSqRSSAIC
##+Year.Remod.Add11.390718.7562841.0
##+Kitchen.Qual41.120919.0252823.8
##+Garage.Yr.Blt10.561119.5852807.2
##+Year.Built10.542619.6042806.5
##+Bsmt.Qual50.545019.6012798.6
##+Garage.Area10.335519.8112798.3
##+Exter.Qual30.407219.7392797.1
##+Garage.Cars10.284619.8622796.3
##+Total.Bsmt.SF10.195219.9512792.8
##+Fireplaces10.083220.0632788.5
##<none>20.1462787.2
##+TotRms.AbvGrd10.006120.1402785.5
##+Mas.Vnr.Area10.001120.1452785.3
##+Full.Bath10.000120.1462785.2
##X1st.Flr.SF10.361120.5072775.4
##BsmtFin.SF.111.949122.0952717.3
##Neighborhood256.293226.4392625.5
##area15.482325.6292601.7
##Overall.Qual16.257226.4032578.5
##
##Step:AIC=2840.96
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area+
##BsmtFin.SF.1+Year.Remod.Add
##
##DfSumofSqRSSAIC
##+Garage.Area10.312718.4432852.1
##+Kitchen.Qual40.433618.3222851.2
##+Total.Bsmt.SF10.277518.4782850.6
##+Year.Built10.255618.5002849.7
##+Garage.Yr.Blt10.245918.5102849.2
##+Bsmt.Qual50.427218.3282848.9
##+Garage.Cars10.227418.5282848.5
##+Fireplaces10.173618.5822846.2
##+Exter.Qual30.153718.6022841.4
##<none>18.7562841.0
##+TotRms.AbvGrd10.026718.7292840.1
##+Full.Bath10.015318.7402839.6
##+Mas.Vnr.Area10.000018.7562839.0
##X1st.Flr.SF10.362719.1182828.0
##Year.Remod.Add11.390720.1462787.2
##BsmtFin.SF.111.937320.6932766.4
##Neighborhood254.525123.2812722.6
##Overall.Qual14.865923.6222663.3
##area15.126123.8822654.7
##
##Step:AIC=2852.06
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area+
##BsmtFin.SF.1+Year.Remod.Add+Garage.Area
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 24/35
05/04/2017 FinalProjectRegressionModels

##
##DfSumofSqRSSAIC
##+Kitchen.Qual40.410818.0322861.6
##+Total.Bsmt.SF10.245918.1972860.5
##+Bsmt.Qual50.395218.0482858.9
##+Year.Built10.191818.2512858.2
##+Fireplaces10.188818.2542858.1
##+Garage.Yr.Blt10.089818.3532853.9
##<none>18.4432852.1
##+Exter.Qual30.139918.3032852.0
##+TotRms.AbvGrd10.029118.4142851.3
##+Full.Bath10.013218.4302850.6
##+Garage.Cars10.006918.4362850.3
##+Mas.Vnr.Area10.000918.4422850.1
##X1st.Flr.SF10.190718.6342846.0
##Garage.Area10.312718.7562841.0
##Year.Remod.Add11.367819.8112798.3
##BsmtFin.SF.111.853720.2972779.4
##Neighborhood254.305422.7482738.6
##area14.601523.0442680.5
##Overall.Qual14.769123.2122674.9
##
##Step:AIC=2861.61
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area+
##BsmtFin.SF.1+Year.Remod.Add+Garage.Area+Kitchen.Qual
##
##DfSumofSqRSSAIC
##+Total.Bsmt.SF10.186717.8452867.7
##+Fireplaces10.176717.8552867.3
##+Bsmt.Qual50.351317.6812866.9
##+Year.Built10.100817.9312864.0
##+Garage.Yr.Blt10.053017.9792861.9
##+TotRms.AbvGrd10.046317.9862861.6
##<none>18.0322861.6
##+Full.Bath10.025618.0062860.7
##+Exter.Qual30.105117.9272860.2
##+Garage.Cars10.005018.0272859.8
##+Mas.Vnr.Area10.000218.0322859.6
##X1st.Flr.SF10.185818.2182855.6
##Kitchen.Qual40.410818.4432852.1
##Garage.Area10.289918.3222851.2
##Year.Remod.Add10.691218.7232834.3
##BsmtFin.SF.111.740519.7732791.8
##Neighborhood254.377122.4092742.3
##Overall.Qual13.946821.9792709.4
##area14.351822.3842695.2
##
##Step:AIC=2867.71
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area+
##BsmtFin.SF.1+Year.Remod.Add+Garage.Area+Kitchen.Qual+
##Total.Bsmt.SF
##
##DfSumofSqRSSAIC
##+Fireplaces10.225017.6202875.6
##X1st.Flr.SF10.001117.8462869.7
##+Year.Built10.068217.7772868.7
##+TotRms.AbvGrd10.053417.7922868.0
##<none>17.8452867.7
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 25/35
05/04/2017 FinalProjectRegressionModels

##+Garage.Yr.Blt10.039017.8062867.4
##+Full.Bath10.030717.8152867.1
##+Bsmt.Qual50.200217.6452866.5
##+Exter.Qual30.109217.7362866.5
##+Garage.Cars10.005817.8402866.0
##+Mas.Vnr.Area10.000217.8452865.7
##Total.Bsmt.SF10.186718.0322861.6
##Kitchen.Qual40.351718.1972860.5
##Garage.Area10.263218.1092858.3
##Year.Remod.Add10.763018.6082837.1
##BsmtFin.SF.111.315419.1612814.3
##Neighborhood254.289022.1342749.9
##Overall.Qual13.609021.4542726.2
##area14.518922.3642693.9
##
##Step:AIC=2875.6
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area+
##BsmtFin.SF.1+Year.Remod.Add+Garage.Area+Kitchen.Qual+
##Total.Bsmt.SF+Fireplaces
##
##DfSumofSqRSSAIC
##X1st.Flr.SF10.001417.6222877.5
##+TotRms.AbvGrd10.062417.5582876.4
##+Year.Built10.061617.5592876.3
##+Garage.Yr.Blt10.056117.5642876.1
##<none>17.6202875.6
##+Exter.Qual30.111817.5092874.6
##+Full.Bath10.018317.6022874.4
##+Garage.Cars10.012317.6082874.1
##+Bsmt.Qual50.189717.4312874.0
##+Mas.Vnr.Area10.001117.6192873.6
##Kitchen.Qual40.331717.9522869.1
##Fireplaces10.225017.8452867.7
##Total.Bsmt.SF10.234917.8552867.3
##Garage.Area10.274317.8952865.6
##Year.Remod.Add10.856518.4772840.6
##BsmtFin.SF.111.080618.7012831.2
##Neighborhood254.047421.6682764.5
##Overall.Qual13.350120.9712742.0
##area13.662921.2832730.5
##
##Step:AIC=2877.53
##log(price)~Overall.Qual+Neighborhood+area+BsmtFin.SF.1+
##Year.Remod.Add+Garage.Area+Kitchen.Qual+Total.Bsmt.SF+
##Fireplaces
##
##DfSumofSqRSSAIC
##+Year.Built10.063017.5592878.3
##+TotRms.AbvGrd10.062117.5602878.3
##+Garage.Yr.Blt10.057317.5642878.1
##<none>17.6222877.5
##+Exter.Qual30.110217.5122876.4
##+Full.Bath10.018517.6032876.3
##+Garage.Cars10.012217.6102876.1
##+Bsmt.Qual50.182117.4402875.6
##+X1st.Flr.SF10.001417.6202875.6
##+Mas.Vnr.Area10.001217.6212875.6
##Kitchen.Qual40.335917.9582870.8
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 26/35
05/04/2017 FinalProjectRegressionModels

##Fireplaces10.224717.8462869.7
##Garage.Area10.274617.8962867.5
##Total.Bsmt.SF10.384718.0062862.7
##Year.Remod.Add10.856918.4792842.5
##BsmtFin.SF.111.079218.7012833.2
##Neighborhood254.073721.6952765.5
##Overall.Qual13.361120.9832743.5
##area13.937121.5592722.4
##
##Step:AIC=2878.32
##log(price)~Overall.Qual+Neighborhood+area+BsmtFin.SF.1+
##Year.Remod.Add+Garage.Area+Kitchen.Qual+Total.Bsmt.SF+
##Fireplaces+Year.Built
##
##DfSumofSqRSSAIC
##+TotRms.AbvGrd10.066917.4922879.3
##<none>17.5592878.3
##+Full.Bath10.034117.5252877.8
##+Exter.Qual30.122117.4372877.8
##Year.Built10.063017.6222877.5
##+Garage.Yr.Blt10.015217.5442877.0
##+Garage.Cars10.007117.5522876.6
##+Mas.Vnr.Area10.002817.5562876.4
##+X1st.Flr.SF10.000117.5592876.3
##+Bsmt.Qual50.167017.3922875.8
##Kitchen.Qual40.273617.8322874.3
##Fireplaces10.222517.7812870.5
##Garage.Area10.244617.8032869.6
##Total.Bsmt.SF10.358517.9172864.6
##Year.Remod.Add10.791518.3502846.0
##BsmtFin.SF.111.027618.5862836.0
##Neighborhood253.525621.0842785.8
##Overall.Qual13.149820.7092751.8
##area13.975521.5342721.3
##
##Step:AIC=2879.3
##log(price)~Overall.Qual+Neighborhood+area+BsmtFin.SF.1+
##Year.Remod.Add+Garage.Area+Kitchen.Qual+Total.Bsmt.SF+
##Fireplaces+Year.Built+TotRms.AbvGrd
##
##DfSumofSqRSSAIC
##+Full.Bath10.046117.4462879.3
##<none>17.4922879.3
##+Exter.Qual30.113317.3782878.4
##TotRms.AbvGrd10.066917.5592878.3
##Year.Built10.067817.5602878.3
##+Garage.Yr.Blt10.015617.4762878.0
##+Garage.Cars10.003117.4892877.4
##+Mas.Vnr.Area10.003117.4892877.4
##+X1st.Flr.SF10.000117.4922877.3
##+Bsmt.Qual50.177217.3152877.2
##Kitchen.Qual40.286117.7782874.7
##Fireplaces10.231417.7232871.1
##Garage.Area10.247417.7392870.4
##Total.Bsmt.SF10.370917.8632864.9
##Year.Remod.Add10.807018.2992846.2
##BsmtFin.SF.111.074418.5662834.9
##area11.415418.9072820.7
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 27/35
05/04/2017 FinalProjectRegressionModels

##Neighborhood253.394820.8872791.1
##Overall.Qual13.183620.6752751.0
##
##Step:AIC=2879.35
##log(price)~Overall.Qual+Neighborhood+area+BsmtFin.SF.1+
##Year.Remod.Add+Garage.Area+Kitchen.Qual+Total.Bsmt.SF+
##Fireplaces+Year.Built+TotRms.AbvGrd+Full.Bath
##
##DfSumofSqRSSAIC
##<none>17.4462879.3
##Full.Bath10.046117.4922879.3
##+Exter.Qual30.121217.3252878.8
##+Garage.Yr.Blt10.017017.4292878.1
##TotRms.AbvGrd10.078917.5252877.8
##+Garage.Cars10.004417.4412877.6
##+Mas.Vnr.Area10.002917.4432877.5
##Year.Built10.087617.5332877.4
##+X1st.Flr.SF10.000017.4462877.4
##+Bsmt.Qual50.153517.2922876.2
##Kitchen.Qual40.291417.7372874.4
##Fireplaces10.213017.6592871.9
##Garage.Area10.241717.6882870.6
##Total.Bsmt.SF10.379817.8262864.6
##Year.Remod.Add10.822618.2682845.5
##BsmtFin.SF.111.057818.5032835.5
##area11.442618.8882819.5
##Neighborhood253.338220.7842793.0
##Overall.Qual13.165620.6112751.5

summary(aic2)

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 28/35
05/04/2017 FinalProjectRegressionModels

##
##Call:
##lm(formula=log(price)~Overall.Qual+Neighborhood+area+
##BsmtFin.SF.1+Year.Remod.Add+Garage.Area+Kitchen.Qual+
##Total.Bsmt.SF+Fireplaces+Year.Built+TotRms.AbvGrd+
##Full.Bath,data=ames_train.2)
##
##Residuals:
##Min1QMedian3QMax
##1.810970.056230.007480.077910.43378
##
##Coefficients:
##EstimateStd.ErrortvaluePr(>|t|)
##(Intercept)1.204e+016.674e02180.363<2e16***
##Overall.Qual8.772e027.575e0311.580<2e16***
##NeighborhoodBlueste1.141e011.085e011.0520.29326
##NeighborhoodBrDale2.156e018.012e022.6910.00728**
##NeighborhoodBrkSide2.960e027.253e020.4080.68338
##NeighborhoodClearCr1.974e019.952e021.9830.04768*
##NeighborhoodCollgCr4.707e026.218e020.7570.44929
##NeighborhoodCrawfor1.814e017.303e022.4840.01323*
##NeighborhoodEdwards9.341e036.725e020.1390.88957
##NeighborhoodGilbert6.910e026.451e021.0710.28449
##NeighborhoodGreens1.393e021.097e010.1270.89893
##NeighborhoodIDOTRR1.461e017.514e021.9440.05230.
##NeighborhoodMeadowV1.814e018.195e022.2130.02718*
##NeighborhoodMitchel7.073e026.750e021.0480.29507
##NeighborhoodNAmes5.172e026.539e020.7910.42918
##NeighborhoodNoRidge8.770e027.161e021.2250.22111
##NeighborhoodNPkVill5.850e029.858e020.5930.55309
##NeighborhoodNridgHt1.173e016.383e021.8380.06650.
##NeighborhoodNWAmes3.707e036.812e020.0540.95662
##NeighborhoodOldTown8.488e027.243e021.1720.24162
##NeighborhoodSawyer6.458e026.785e020.9520.34150
##NeighborhoodSawyerW6.020e036.496e020.0930.92619
##NeighborhoodSomerst8.649e026.248e021.3840.16667
##NeighborhoodStoneBr1.895e017.065e022.6830.00746**
##NeighborhoodSWISU1.760e028.613e020.2040.83816
##NeighborhoodTimber1.193e017.088e021.6830.09286.
##NeighborhoodVeenker1.201e018.220e021.4610.14448
##area2.076e042.656e057.8171.86e14***
##BsmtFin.SF.11.062e041.586e056.6944.30e11***
##Year.Remod.Add2.470e034.184e045.9035.44e09***
##Garage.Area1.278e043.995e053.2000.00143**
##Kitchen.QualFa7.685e025.472e021.4040.16061
##Kitchen.QualGd1.105e022.742e020.4030.68718
##Kitchen.QualPo2.264e011.774e011.2760.20229
##Kitchen.QualTA7.025e023.185e022.2060.02772*
##Total.Bsmt.SF8.056e052.008e054.0116.65e05***
##Fireplaces3.404e021.133e023.0040.00276**
##Year.Built9.136e044.743e041.9260.05445.
##TotRms.AbvGrd1.198e026.549e031.8280.06788.
##Full.Bath2.416e021.728e021.3970.16268
##
##Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1
##
##Residualstandarderror:0.1536on739degreesoffreedom

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 29/35
05/04/2017 FinalProjectRegressionModels

##MultipleRsquared:0.8766,AdjustedRsquared:0.87
##Fstatistic:134.5on39and739DF,pvalue:<2.2e16

2.3.2Section3.2Transformation

Ionlyconductedastudytovisualizethepredictorvariablesanddetermineiftheyshouldbelogor
exponentiallytransformed.AlthoughIcouldobtainsomenormaldistributionsaftercarefulanalysis,Icould
notimprovethefinalfitofthemodel,soIdecidednottoincludevariabletransformations,givenalotofwork
wasnecessarytoevaluateeachpredictorindividually.

2.3.3Section3.3VariableInteraction
Didyoudecidetoincludeanyvariableinteractions?Whyorwhynot?Explaininafewsentences.

Icouldnotthinkofanyvariableinteractioninthisdatasetanddidnottrytofitallpossibleinteractions,given
thatthiscouldleadmetofalsepositiveswhentestingifthecoefficientwassignificant.

2.3.4Section3.4VariableSelection
Whatmethoddidyouusetoselectthevariablesyouincluded?Whydidyouselectthemethodyouused?
Explaininafewsentences.

Amongthevariableselectiontechniqueswehaveseeninthecourse,IdecidedtousetheAICAikake
InformationCriteriainsteadofBICBayesianInformationCriteriaorstepwiseusingthePValuesbecauseit
wasametriceasiertounderstandandlessrigorousthanBICanddidnotinvolvehypothesistestings,like
stepwisewithPValues,thatcouldleadmetofalseresults.

TheprocedureofthefinalmodelselectioncanbeseeninthefirstblockofFinalmodelsection.

2.3.5Section3.5ModelTesting

TestingonoutofsampledataconfirmedmethatIwasnotoverfittingmymodel.Ididnotpresentallthe
modelsIhavefitinthisassignment,butIcheckedthetestdataeachtimeIinsertedafewnumericalof
characterattributesandcheckedifIimproved(anddidnotraised)theerroronthetestdata.

2.4Part4FinalModelAssessment
2.4.1Section4.1FinalModelResidual
Foryourfinalmodel,createandbrieflyinterpretaninformativeplotoftheresiduals.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 30/35
05/04/2017 FinalProjectRegressionModels

Togetthepredictionsforthetestset,wemustfirstperformthesameoperationswedidfirstwiththetraining
set.Rescalingnumericalattributes,andprocessingfactorvariables.

Theresidualsfromthetestsetalsohadanormaldistributionappearancearound0andtheqqplotalso
helpstoevidencethattheresidualsdonotdeviatetoomuchfromnormal.

#trainingdataRMSE
testModRel2_RMSE<sqrt(mean((ames_train.2$priceexp(aic2$fitted.values))^2))

###TestDataPreprocessing

load("ames_test.Rdata")
ames_test.2<ames_test

ames_test.2[,integer.cols.index]<scale(ames_test[,integer.cols.index],center=TRUE,scale=FA
LSE)
ames_test.2$price<ames_test$price

#checkifanyofthesecolumnshaveNAinthetestset
Test.cols.interest<c('Fireplaces','BsmtFin.SF.1',
'Full.Bath','TotRms.AbvGrd',
'Year.Remod.Add','Year.Built',
'Total.Bsmt.SF','Garage.Area',
'area','Overall.Qual','Neighborhood',
'Kitchen.Qual')

Test.na.rows<which(apply(ames_test.2[,Test.cols.interest],1,function(row){
any(is.na(row))
}))
#ames_test.2<ames_test.2[Test.na.rows,]

ames_test.2.Qual_sub<as.character(ames_test.2$Bsmt.Qual)
ames_test.2.Qual_sub[is.na(ames_test.2.Qual_sub)]<'NA'
ames_test.2$Bsmt.Qual<relevel(factor(ames_test.2.Qual_sub),ref='TA')
ames_test.2$Neighborhood<droplevels(ames_test.2$Neighborhood)
ames_test.2$Neighborhood<factor(ames_test.2$Neighborhood,levels=c(levels(ames_test.2$Ne
ighborhood),'GrnHill_LandMrk'))


testModRel2_predictions<predict(aic2,ames_test.2)
residualsq10<testModRel2_predictionslog(ames_test.2$price)

par(mfrow=c(1,2))
plot(residualsq10,main='FinalModelResidualsforvalidationdata')
qqnorm(residualsq10)
qqline(residualsq10)

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 31/35
05/04/2017 FinalProjectRegressionModels

par(mfrow=c(1,1))

2.4.2Section4.2FinalModelRMSE
Foryourfinalmodel,calculateandbrieflycommentontheRMSE.

RMSEq11<sqrt(mean((ames_test.2$priceexp(testModRel2_predictions))^2))
RMSEq11

##[1]46660.19

TheRMSEforthetestsetwas,asexpected,higherthanthetrainingsetbut,comparingthesameRMSE
fromtheprevioustestsetpredictions,ithadagreatimprovement!TheRMSEscoredecreasedfrom
58072.19to46660.19.Asweareseeingthetestsettoevaluateoutoftheboxtesting,thisisalsoan
optimisticversionoftheRMSE.Forabetter(conservative)RMSEscore,wemustgiveafinalpredictiontoa
setwehaveneverseenbefore.

2.4.3Section4.3FinalModelEvaluation
Whataresomestrengthsandweaknessesofyourmodel?

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 32/35
05/04/2017 FinalProjectRegressionModels

Themodeldevelopeduntilnowisasimpleone,withoutfurtherinvestigationoftheremainingfeaturesand
itsinteractionsorpolynomials.Despitethislackofnewfeatures,themodelseemedtogeneralizewell,asit
couldbeseenintheoutofboxtesting.

Withlittlevariables,themodelseemstogiveagoodandundestandablefit,inasensetheparameters
canaidtherealestateemployeesindecidingiftheirhousesarebeingsoldunderofoverpriced.

Someweaknesses:

Somecategoricalvariables(Neighboorhoodforexample)hasfewexamplesofsomeofitsdomains,
havingsomeneighborhoodswithatotalcountofexamples<10.Thiscouldgiveusaunrealistic
parameterestimation.Asolutiontothiswouldbetogetmoreexamplesorreducethenumberof
categoriesandseeifitstillgetsagoodfit.

Ididnotconsidermulticollinearityamongthepredictors.Iftherewassomebetweentheregression
variablesIused,wecanhavesomeunstablepredictorsand,therefore,unstablepredictions.

Infuturesteps,Iwouldconsiderthesetwosteps,astheycangiveunrealisticpredictionsforthehouse
prices.

2.4.4Section4.4FinalModelValidation
Testingyourfinalmodelonaseparate,validationdatasetisagreatwaytodeterminehowyourmodelwill
performinreallifepractice.

Youwillusetheames_validationdatasettodosomeadditionalassessmentofyourfinalmodel.Discuss
yourfindings,besuretomention:*WhatistheRMSEofyourfinalmodelwhenappliedtothevalidation
data?
*Howdoesthisvaluecomparetothatofthetrainingdataand/ortestingdata?*Whatpercentageofthe
95%predictiveconfidence(orcredible)intervalscontainthetruepriceofthehouseinthevalidationdata
set?
*Fromthisresult,doesyourfinalmodelproperlyreflectuncertainty?

load("ames_validation.Rdata")

Beforeapplyingthemodeltothefinaldataset,wemustperformthesamepreprocessingstepswedidwith
previousdatasets,thatis,centeringandfactorvariablesprocessing.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 33/35
05/04/2017 FinalProjectRegressionModels

#preprocessdata
ames_validation.2<ames_validation
ames_validation.2[,integer.cols.index]<scale(ames_validation[,integer.cols.index],center=T
RUE,scale=FALSE)
ames_validation.2$price<ames_validation$price

Test.cols.interest<c('Fireplaces','BsmtFin.SF.1',
'Full.Bath','TotRms.AbvGrd',
'Year.Remod.Add','Year.Built',
'Total.Bsmt.SF','Garage.Area',
'area','Overall.Qual','Neighborhood',
'Kitchen.Qual')

Test.na.rows<which(apply(ames_validation.2[,Test.cols.interest],1,function(row){
any(is.na(row))
}))
#ames_test.2<ames_test.2[Test.na.rows,]

ames_validation.2.Qual_sub<as.character(ames_validation.2$Bsmt.Qual)
ames_validation.2.Qual_sub[is.na(ames_validation.2.Qual_sub)]<'NA'
ames_validation.2$Bsmt.Qual<relevel(factor(ames_validation.2.Qual_sub),ref='TA')

ames_validation.2.Neighborhood.TEMP<as.character(ames_validation.2$Neighborhood)
ames_validation.2.Neighborhood.TEMP[ames_validation.2.Neighborhood.TEMP=='GrnHill'|
ames_validation.2.Neighborhood.TEMP=='Landmrk']<'GrnHill_Landmrk'
ames_validation.2$Neighborhood<factor(ames_validation.2.Neighborhood.TEMP,
levels=
unique(ames_validation.2.Neighborhood.TEMP))
ames_validation.2<as.data.frame(ames_validation.2%>%filter(!Neighborhood%in%c('GrnHill
_Landmrk','Landmrk')))

Finally,theRMSEforthevalidationsetis:

valModRel2_predictions<predict(aic,ames_validation.2)

valModRel2_RMSEq6<sqrt(mean((ames_validation.2$priceexp(valModRel2_predictions))^2))
valModRel2_RMSEq6

##[1]70301.83

Asexpected,theRMSEforthevalidationsetisalittlehigherthanthetestset,aswewereusingthetestset
aspartofthetrainingprocess,thusreducingtheRMSEartificially.Next,inordertotestiftheassumptions
aboutuncertaintyaremet,wecalculatethecoverageprobability.Ifourmodelmetsthiscriteria,roughtly95%
ofthedatashouldbeinsideofthe95%predictionconfidenceinterval.

#Predictprices
predict.full<exp(predict(aic,ames_validation.2,interval="prediction"))

#Calculateproportionofobservationsthatfallwithinpredictionintervals
coverage.prob.full<mean(ames_validation.2$price>predict.full[,"lwr"]&
ames_validation.2$price<predict.full[,"upr"])
coverage.prob.full

##[1]0.95996

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 34/35
05/04/2017 FinalProjectRegressionModels

Aswecanseeabove,~96%ofourdataisinsideour95%predictionconfidenceinterval,showingthatthe
finalmodelproperlyreflectuncertainty.

2.5Part5Conclusion
Provideabriefsummaryofyourresults,andabriefdiscussionofwhatyouhavelearnedaboutthedataand
yourmodel.

Althoughthisstudymanagedtomodelagotfittothedata,muchmoreworkcanbedoneasnextsteps.New
variablescouldbeenginneredandpolynomialregressionscouldgiveaportionofnonlinearityinthemodel.
Stillonthenonlinearregressionwecouldimposesomeregularizationtoimproveoutofsampleaccuracy
andalsoprovidessomemorevariableselectionprocedures.

Thebuiltmodelshowedthatwithafewgoodselectedvariables,wecanalreadyprovideagoodexplanation
ofhowthehousepricesofAmescanbeexplained.Inthisstudy,suchvariablesasoverallqualityandyear
builtprovidedagoodexplanationinawayofpercentageofvariancefromthemeanandcouldalreadyhelp
realestateinvestorandemployeestoseeiftheyarecharging(orbeingcharged)abovetheexpectedprice.

Also,itisimportanttonotehowtosplityourdatainordertonotgiveyourselfanoptimisticerrorratefrom
thetrainingset,andalsohowtobettertestthedataavailableinordertobetterunderstandhowyourmodel
willperformwithneverseenbeforedata.Lastly,themodelselectionprocedureshelpedtoeliminatebad
predictorsinanautomaticway,aswellhelpedtocontrolmodelcomplexity.Theseprocedurescombined
helpedtobuildafinalmodelwithfewvariablesbutwithahighpredictioncapacity.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 35/35

Você também pode gostar