Você está na página 1de 33

DataMiningwithWeka

Class2 Lesson1
Beaclassifier!

IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand

weka.waikato.ac.nz
Lesson2.1:Beaclassifier!

Class1
GettingstartedwithWeka

Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting

Class3
Simpleclassifiers Lesson2.3Moretraining/testing

Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Lesson2.1:Beaclassifier!

Interactivedecisiontreeconstruction

Loadsegmentchallenge.arff;lookatdataset
SelectUserClassifier (treeclassifier)
Usethetestsetsegmenttest.arff
Examinedatavisualizerandtreevisualizer
Plotregioncentroidrow vsintensitymean
Rectangle,PolygonandPolylineselectiontools
severalselections
RightclickinTreevisualizer andAcceptthetree
Overtoyou:howwellcanyoudo?
Lesson2.1:Beaclassifier!

Buildatree:whatstrategydidyouuse?

Givenenoughtime,youcouldproduceaperfect tree
forthedataset
butwoulditperformwellonthetestdata?

Coursetext
Section11.2Doityourself:theUserClassifier
DataMiningwithWeka
Class2 Lesson2
Trainingandtesting

IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand

weka.waikato.ac.nz
Lesson2.2:Trainingandtesting

Class1
GettingstartedwithWeka

Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting

Class3
Simpleclassifiers Lesson2.3Moretraining/testing

Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Lesson2.2:Trainingandtesting

Test
data

Training ML Classifier Deploy!


data algorithm

Evaluation
results
Lesson2.2:Trainingandtesting

Test
data

Training ML Classifier Deploy!


data algorithm

Evaluation
results
Basicassumption:trainingandtestsetsproducedby
independentsamplingfromaninfinitepopulation
Lesson2.2:Trainingandtesting

UseJ48toanalyzethesegmentdataset

Openfilesegmentchallenge.arff
ChooseJ48decisiontreelearner(trees>J48)
Suppliedtestsetsegmenttest.arff
Runit:96%accuracy
Evaluateontrainingset:99%accuracy
Evaluateonpercentagesplit:95%accuracy
Doitagain:getexactlythesameresult!
Lesson2.2:Trainingandtesting

Basicassumption:
trainingandtestsetssampledindependentlyfroman
infinitepopulation
Justonedataset? holdsomeoutfortesting
Expectslightvariationinresults
butWekaproducessameresultseachtime
J48onsegmentchallengedataset

Coursetext
Section5.1 Trainingandtesting
DataMiningwithWeka
Class2 Lesson3
Repeatedtrainingandtesting

IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand

weka.waikato.ac.nz
Lesson2.3:Repeatedtrainingandtesting

Class1
GettingstartedwithWeka

Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting

Class3
Simpleclassifiers Lesson2.3Moretraining/testing

Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Lesson2.3:Repeatedtrainingandtesting

EvaluateJ48onsegmentchallenge
0.967
Withsegmentchallenge.arff 0.940
andJ48(trees>J48) 0.940
Setpercentagesplit to90% 0.967
Runit:96.7%accuracy 0.953
Repeat 0.967
0.920
[Moreoptions]Repeatwithseed
0.947
2,3,4,5,6,7,8,910
0.933
0.947
Lesson2.3:Repeatedtrainingandtesting

EvaluateJ48onsegmentchallenge
0.967
0.940
Samplemean x =
xi 0.940
n 0.967
Variance 2 =
(xi x)2 0.953
n1 0.967
0.920
Standarddeviation 0.947
0.933
0.947

x =0.949, =0.018
Lesson2.3:Repeatedtrainingandtesting

Basicassumption:
trainingandtestsetssampledindependentlyfromaninfinite
population
Expectslightvariationinresults
getitbysettingtherandomnumberseed
Cancalculatemeanandstandarddeviationexperimentally
DataMiningwithWeka
Class2 Lesson4
Baselineaccuracy

IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand

weka.waikato.ac.nz
Lesson2.4:Baselineaccuracy

Class1
GettingstartedwithWeka

Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting

Class3
Simpleclassifiers Lesson2.3Moretraining/testing

Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Lesson2.4:Baselineaccuracy

Usediabetesdatasetanddefaultholdout
Openfilediabetes.arff
Testoption:Percentagesplit
Trytheseclassifiers:
trees>J48 76%
bayes >NaiveBayes 77%
lazy>IBk 73%
rules>PART 74%
(well learnaboutthemlater)
768instances(500negative,268positive)
Alwaysguessnegative:500/768 65%
rules>ZeroR:mostlikelyclass!
Lesson2.4:Baselineaccuracy

Sometimesbaselineisbest!
Opensupermarket.arff andblindlyapply
rules>ZeroR 64%
trees>J48 63%
bayes >NaiveBayes 63%
lazy>IBk 38%(!!)
rules>PART 63%
Attributesarenotinformative
DontjustapplyWeka toadataset:
youneedtounderstandwhatsgoingon!
Lesson2.4:Baselineaccuracy

Considerwhetherdifferencesarelikelytobe
significant
Alwaystryasimplebaseline,
e.g.rules>ZeroR

Lookatthedataset
DontblindlyapplyWeka:
trytounderstandwhatsgoingon!
DataMiningwithWeka
Class2 Lesson5
Crossvalidation

IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand

weka.waikato.ac.nz
Lesson2.5:Crossvalidation

Class1
GettingstartedwithWeka

Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting

Class3
Simpleclassifiers Lesson2.3Moretraining/testing

Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Lesson2.5:Crossvalidation

Canweimproveuponrepeatedholdout?
(i.e.reducevariance)
Crossvalidation
Stratifiedcrossvalidation
Lesson2.5:Crossvalidation

Repeatedholdout
(inLesson2.3,holdout10%fortesting,repeat10times)

(repeat10times)
Lesson2.5:Crossvalidation

10foldcrossvalidation

Dividedatasetinto10parts(folds)
Holdouteachpartinturn
Averagetheresults
Eachdatapointusedoncefortesting,9timesfortraining

Stratified crossvalidation
Ensurethateachfoldhastheright
proportionofeachclassvalue
Lesson2.5:Crossvalidation

Aftercrossvalidation,Wekaoutputsan
extramodelbuiltontheentiredataset
10%ofdata
10times
ML Classifier
90%ofdata
algorithm

Evaluationresults

11thtime
ML Classifier
100%ofdata Deploy!
algorithm
Lesson2.5:Crossvalidation

Crossvalidationbetterthanrepeatedholdout
Stratifiedisevenbetter
With10foldcrossvalidation,Wekainvokesthelearning
algorithm11times
Practicalruleofthumb:
Lotsofdata? usepercentagesplit
Elsestratified10foldcrossvalidation

Coursetext
Section5.3 Crossvalidation
DataMiningwithWeka
Class2 Lesson6
Crossvalidationresults

IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand

weka.waikato.ac.nz
Lesson2.6:Crossvalidationresults

Class1
GettingstartedwithWeka

Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting

Class3
Simpleclassifiers Lesson2.3Moretraining/testing

Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Lesson2.6:Crossvalidationresults

Iscrossvalidationreallybetterthanrepeatedholdout?

Diabetes dataset
Baselineaccuracy(rules>ZeroR): 65.1%
trees>J48
10foldcrossvalidation 73.8%
withdifferentrandomnumberseed
1 2 3 4 5 6 7 8 9 10
73.8 75.0 75.5 75.5 74.4 75.6 73.6 74.0 74.5 73.0
Lesson2.6:Crossvalidationresults

holdout crossvalidation
(10%) (10fold)
75.3 73.8
77.9 75.0
xi
Samplemean x = 80.5 75.5
n 74.0 75.5
(xi x)2 71.4 74.4
Variance 2 =
n1 70.1 75.6
79.2 73.6
Standarddeviation 71.4 74.0
80.5 74.5
67.5 73.0

x =74.8 x =74.5
= 4.6 = 0.9
Lesson2.6:Crossvalidationresults

Why10fold?E.g.20fold:75.1%

Crossvalidationreallyisbetterthanrepeatedholdout
Itreducesthevarianceoftheestimate
DataMiningwithWeka
DepartmentofComputerScience
UniversityofWaikato
NewZealand

CreativeCommonsAttribution3.0Unported License

creativecommons.org/licenses/by/3.0/

weka.waikato.ac.nz

Você também pode gostar