Escolar Documentos
Profissional Documentos
Cultura Documentos
Thisdatadescribesthesurvivalstatusof1309ofthe1324individualpassengersonthe
Titanic.
Survived:YesorNo
PassengerClass:1,2,or3correspondingto1st,2nd,or3rdclass
Sex:Passengersex
Age:Passengerage
SiblingsandSpouses:Thenumberofsiblingsandspousesaboard
ParentsandChildren:Thenumberofparentsandchildrenaboard
Fare:Thepassengerfare
Port:Portofembarkment(C=Cherbourg;Q=Queenstown;S=Southampton)
a.Howmanysplitsareinyourfinaltree(Hint:Go)?Pleaseincludevisualizationofyoursplit
history,andshortexplanationonhowSASJMParrivedatthisnumberofsplits.
Thereare6splitsinmyfinaltree,basedon30%validationand70%training.Figure1showsthe
coefficientofdeterminationforbothtrainingandvalidationdata.
The Split History report (Figure 2) shows how the R Square value changes for
training and validation data after each split. The vertical line is drawn at the
number of splits used in the final model, which has 6 splits. However, after 6
splits, the validation R Square (the red line in Figure 2) starts to decrease. For
the validation set, which was not used to build the model, additional splits
are not improving our ability to predict the response.
Taskin Shakib
b.Whichvariablesarethelargestcontributors?Expectedresponsesaresomewhatsubjective,
butassesstheG^2Entropy(orInformationGain)scores,andmakeananalystdecisionon
whetheryouviewasharpdropofftofinalizeyouranswer.
ThelargestcontributorinmymodelisSex(asshowninFigure3).Sinceweknowthattheruleis
tosplitwiththehighestLogWorth,themodelstartedsplittingwithsex(LogWorth56.584).
TheG^2valueforthevariablesexisalsothehighest(254.58).Thesecondlargestcontributorin
ourmodelwasPassengerClass,withG^2valueof72.54andLogWorthof16.79.
c.Whatisthevalidationmisclassificationrateforthismodel?Isthemodelbetteratpredicting
survivalornonsurvival?
Themisclassificationrateforourvalidationdatais0.1937,or19.73%.Thenumbers
behindthemisclassificationratecanbeseenintheconfusionmatrix(Figure4).Wefocus
onthemisclassificationrateandconfusionmatrixforthevalidationdata.Sincethesedata
werenotusedinbuildingthemodel,thisprovidesabetterindicationofhowwellthe
modelclassifiesSurvived.
Taskin Shakib
Therearefourpossibleoutcomesinourclassification:
Asurvivedpassengeriscorrectlyclassifiedassurvived.
Asurvivedpassengerismisclassifiedassurvived
Apassengerwhodidnotsurviveismisclassifiedassurvived
Apassengerwhodidnotsurviveiscorrectlyclassifiedasdidnotsurvive
d. What is the area under the validation ROC curve for Survived? Interpret this value.
Does the model do a better job of classifying survival than a random model?
Theareaunderthecurve,orAUC(labeledAreainFigure5)isameasureofhowwell
ourmodelsortsthedata.TheareaunderthecurveforSurvived=Yesis0.8591(see
Figure5),indicatingthatthemodelpredictsbetterthantherandomsortingmodel.
Taskin Shakib
e.Whatistheliftforthemodelatportion=0.3andatportion=0.5?Interpretthesevalues.
Thehighertheliftatagivenportion,thebetterourmodelisatcorrectlyclassifyingtheoutcome
withinthisportion.
ForSurvived=Yes,theliftatPortion=0.3isroughly1.475(seeFigure6).Thismeans
thatinrowsinthedatatablethatcorrespondtothetop30%ofthemodelspredicted
probabilities,thenumberofactualYesoutcomesis1.475timeshigherthanwewould
expectifwehadjustchosen15%oftherowsfromthedatasetatrandom.
ForSurvived=Yes,theliftatPortion=0.4isroughly1.4(seeFigure6).Thismeansthat
inrowsinthedatatablethatcorrespondtothetop40%ofthemodelspredicted
probabilities,thenumberofactualYesoutcomesis1.4timeshigherthanwewould
expectifwehadjustchosen40%oftherowsfromthedatasetatrandom.
Taskin Shakib
Taskin Shakib