Você está na página 1de 6

Taskin Shakib

Thisdatadescribesthesurvivalstatusof1309ofthe1324individualpassengersonthe
Titanic.
Survived:YesorNo
PassengerClass:1,2,or3correspondingto1st,2nd,or3rdclass
Sex:Passengersex
Age:Passengerage
SiblingsandSpouses:Thenumberofsiblingsandspousesaboard
ParentsandChildren:Thenumberofparentsandchildrenaboard
Fare:Thepassengerfare
Port:Portofembarkment(C=Cherbourg;Q=Queenstown;S=Southampton)
a.Howmanysplitsareinyourfinaltree(Hint:Go)?Pleaseincludevisualizationofyoursplit
history,andshortexplanationonhowSASJMParrivedatthisnumberofsplits.
Thereare6splitsinmyfinaltree,basedon30%validationand70%training.Figure1showsthe
coefficientofdeterminationforbothtrainingandvalidationdata.

Figure 1: Splits in Final Tree

The Split History report (Figure 2) shows how the R Square value changes for
training and validation data after each split. The vertical line is drawn at the
number of splits used in the final model, which has 6 splits. However, after 6
splits, the validation R Square (the red line in Figure 2) starts to decrease. For
the validation set, which was not used to build the model, additional splits
are not improving our ability to predict the response.

Figure 2: Split History

Taskin Shakib
b.Whichvariablesarethelargestcontributors?Expectedresponsesaresomewhatsubjective,
butassesstheG^2Entropy(orInformationGain)scores,andmakeananalystdecisionon
whetheryouviewasharpdropofftofinalizeyouranswer.
ThelargestcontributorinmymodelisSex(asshowninFigure3).Sinceweknowthattheruleis
tosplitwiththehighestLogWorth,themodelstartedsplittingwithsex(LogWorth56.584).
TheG^2valueforthevariablesexisalsothehighest(254.58).Thesecondlargestcontributorin
ourmodelwasPassengerClass,withG^2valueof72.54andLogWorthof16.79.

Figure 3: Survived, First Candidate Split

c.Whatisthevalidationmisclassificationrateforthismodel?Isthemodelbetteratpredicting
survivalornonsurvival?

Themisclassificationrateforourvalidationdatais0.1937,or19.73%.Thenumbers
behindthemisclassificationratecanbeseenintheconfusionmatrix(Figure4).Wefocus
onthemisclassificationrateandconfusionmatrixforthevalidationdata.Sincethesedata
werenotusedinbuildingthemodel,thisprovidesabetterindicationofhowwellthe
modelclassifiesSurvived.

Taskin Shakib

Figure 4: Misclassification Rate and Confusion Matrix

Therearefourpossibleoutcomesinourclassification:

Asurvivedpassengeriscorrectlyclassifiedassurvived.
Asurvivedpassengerismisclassifiedassurvived
Apassengerwhodidnotsurviveismisclassifiedassurvived
Apassengerwhodidnotsurviveiscorrectlyclassifiedasdidnotsurvive

d. What is the area under the validation ROC curve for Survived? Interpret this value.
Does the model do a better job of classifying survival than a random model?

Theareaunderthecurve,orAUC(labeledAreainFigure5)isameasureofhowwell
ourmodelsortsthedata.TheareaunderthecurveforSurvived=Yesis0.8591(see
Figure5),indicatingthatthemodelpredictsbetterthantherandomsortingmodel.

Taskin Shakib

Figure 5: ROC Curve for Survived

e.Whatistheliftforthemodelatportion=0.3andatportion=0.5?Interpretthesevalues.
Thehighertheliftatagivenportion,thebetterourmodelisatcorrectlyclassifyingtheoutcome
withinthisportion.

ForSurvived=Yes,theliftatPortion=0.3isroughly1.475(seeFigure6).Thismeans
thatinrowsinthedatatablethatcorrespondtothetop30%ofthemodelspredicted
probabilities,thenumberofactualYesoutcomesis1.475timeshigherthanwewould
expectifwehadjustchosen15%oftherowsfromthedatasetatrandom.
ForSurvived=Yes,theliftatPortion=0.4isroughly1.4(seeFigure6).Thismeansthat
inrowsinthedatatablethatcorrespondtothetop40%ofthemodelspredicted
probabilities,thenumberofactualYesoutcomesis1.4timeshigherthanwewould
expectifwehadjustchosen40%oftherowsfromthedatasetatrandom.

Taskin Shakib

Figure 6: Lift Curve for Survived

f. Summarize the three purest segments with respect to survival


as if I were your manager at work, and I wanted a business
language summary of the top three groups that the Decision Tree
identified. (Hint: Leaf Report can be helpful here by reinterpreting
the English rules that the Decision Tree generates into business
language)
The three purest segments with respect to survival would be Sex, passenger
class and age. If we observe Figure 7, we can see that The highest
probability that a passenger had survived is (0.9238), shown in the 2nd row
of the leaf report, has three splits: Firstly, gender of the passenger being
male, secondly split on age, and last split on the number of family members
accompanied by that particular passenger (less than 4).
Here is the interpretation of this leaf, or decision rule: When the sex of a
passenger is male, age is less than 11, and siblings and spouses less than 4,
the predicted probability that survived = Yes is 0.9238 (and the probability
that survived = No is 0.0762).
In business terms in means that, the probability that a passenger survived
the Titanic ship sink was 92.38%, if the passenger was a less than 11 years
old male with less than 4 siblings accompanying him in the journey. In cases
of female passengers, if she belonged to either 1st or 2nd class, she had a
92% chance of surviving. This means for male passengers, age and number
of family member were the most important deciding factors. On the other
hand, for female passengers the most crucial deciding factor of survival in
that crash was the class of ticket she bought or was a part of.

Taskin Shakib

Figure 7: Leaf Report

Você também pode gostar