Escolar Documentos
Profissional Documentos
Cultura Documentos
Class2 Lesson1
Beaclassifier!
IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand
weka.waikato.ac.nz
Lesson2.1:Beaclassifier!
Class1
GettingstartedwithWeka
Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting
Class3
Simpleclassifiers Lesson2.3Moretraining/testing
Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Lesson2.1:Beaclassifier!
Interactivedecisiontreeconstruction
Loadsegmentchallenge.arff;lookatdataset
SelectUserClassifier (treeclassifier)
Usethetestsetsegmenttest.arff
Examinedatavisualizerandtreevisualizer
Plotregioncentroidrow vsintensitymean
Rectangle,PolygonandPolylineselectiontools
severalselections
RightclickinTreevisualizer andAcceptthetree
Overtoyou:howwellcanyoudo?
Lesson2.1:Beaclassifier!
Buildatree:whatstrategydidyouuse?
Givenenoughtime,youcouldproduceaperfect tree
forthedataset
butwoulditperformwellonthetestdata?
Coursetext
Section11.2Doityourself:theUserClassifier
DataMiningwithWeka
Class2 Lesson2
Trainingandtesting
IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand
weka.waikato.ac.nz
Lesson2.2:Trainingandtesting
Class1
GettingstartedwithWeka
Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting
Class3
Simpleclassifiers Lesson2.3Moretraining/testing
Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Lesson2.2:Trainingandtesting
Test
data
Evaluation
results
Lesson2.2:Trainingandtesting
Test
data
Evaluation
results
Basicassumption:trainingandtestsetsproducedby
independentsamplingfromaninfinitepopulation
Lesson2.2:Trainingandtesting
UseJ48toanalyzethesegmentdataset
Openfilesegmentchallenge.arff
ChooseJ48decisiontreelearner(trees>J48)
Suppliedtestsetsegmenttest.arff
Runit:96%accuracy
Evaluateontrainingset:99%accuracy
Evaluateonpercentagesplit:95%accuracy
Doitagain:getexactlythesameresult!
Lesson2.2:Trainingandtesting
Basicassumption:
trainingandtestsetssampledindependentlyfroman
infinitepopulation
Justonedataset? holdsomeoutfortesting
Expectslightvariationinresults
butWekaproducessameresultseachtime
J48onsegmentchallengedataset
Coursetext
Section5.1 Trainingandtesting
DataMiningwithWeka
Class2 Lesson3
Repeatedtrainingandtesting
IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand
weka.waikato.ac.nz
Lesson2.3:Repeatedtrainingandtesting
Class1
GettingstartedwithWeka
Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting
Class3
Simpleclassifiers Lesson2.3Moretraining/testing
Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Lesson2.3:Repeatedtrainingandtesting
EvaluateJ48onsegmentchallenge
0.967
Withsegmentchallenge.arff 0.940
andJ48(trees>J48) 0.940
Setpercentagesplit to90% 0.967
Runit:96.7%accuracy 0.953
Repeat 0.967
0.920
[Moreoptions]Repeatwithseed
0.947
2,3,4,5,6,7,8,910
0.933
0.947
Lesson2.3:Repeatedtrainingandtesting
EvaluateJ48onsegmentchallenge
0.967
0.940
Samplemean x =
xi 0.940
n 0.967
Variance 2 =
(xi x)2 0.953
n1 0.967
0.920
Standarddeviation 0.947
0.933
0.947
x =0.949, =0.018
Lesson2.3:Repeatedtrainingandtesting
Basicassumption:
trainingandtestsetssampledindependentlyfromaninfinite
population
Expectslightvariationinresults
getitbysettingtherandomnumberseed
Cancalculatemeanandstandarddeviationexperimentally
DataMiningwithWeka
Class2 Lesson4
Baselineaccuracy
IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand
weka.waikato.ac.nz
Lesson2.4:Baselineaccuracy
Class1
GettingstartedwithWeka
Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting
Class3
Simpleclassifiers Lesson2.3Moretraining/testing
Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Lesson2.4:Baselineaccuracy
Usediabetesdatasetanddefaultholdout
Openfilediabetes.arff
Testoption:Percentagesplit
Trytheseclassifiers:
trees>J48 76%
bayes >NaiveBayes 77%
lazy>IBk 73%
rules>PART 74%
(well learnaboutthemlater)
768instances(500negative,268positive)
Alwaysguessnegative:500/768 65%
rules>ZeroR:mostlikelyclass!
Lesson2.4:Baselineaccuracy
Sometimesbaselineisbest!
Opensupermarket.arff andblindlyapply
rules>ZeroR 64%
trees>J48 63%
bayes >NaiveBayes 63%
lazy>IBk 38%(!!)
rules>PART 63%
Attributesarenotinformative
DontjustapplyWeka toadataset:
youneedtounderstandwhatsgoingon!
Lesson2.4:Baselineaccuracy
Considerwhetherdifferencesarelikelytobe
significant
Alwaystryasimplebaseline,
e.g.rules>ZeroR
Lookatthedataset
DontblindlyapplyWeka:
trytounderstandwhatsgoingon!
DataMiningwithWeka
Class2 Lesson5
Crossvalidation
IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand
weka.waikato.ac.nz
Lesson2.5:Crossvalidation
Class1
GettingstartedwithWeka
Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting
Class3
Simpleclassifiers Lesson2.3Moretraining/testing
Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Lesson2.5:Crossvalidation
Canweimproveuponrepeatedholdout?
(i.e.reducevariance)
Crossvalidation
Stratifiedcrossvalidation
Lesson2.5:Crossvalidation
Repeatedholdout
(inLesson2.3,holdout10%fortesting,repeat10times)
(repeat10times)
Lesson2.5:Crossvalidation
10foldcrossvalidation
Dividedatasetinto10parts(folds)
Holdouteachpartinturn
Averagetheresults
Eachdatapointusedoncefortesting,9timesfortraining
Stratified crossvalidation
Ensurethateachfoldhastheright
proportionofeachclassvalue
Lesson2.5:Crossvalidation
Aftercrossvalidation,Wekaoutputsan
extramodelbuiltontheentiredataset
10%ofdata
10times
ML Classifier
90%ofdata
algorithm
Evaluationresults
11thtime
ML Classifier
100%ofdata Deploy!
algorithm
Lesson2.5:Crossvalidation
Crossvalidationbetterthanrepeatedholdout
Stratifiedisevenbetter
With10foldcrossvalidation,Wekainvokesthelearning
algorithm11times
Practicalruleofthumb:
Lotsofdata? usepercentagesplit
Elsestratified10foldcrossvalidation
Coursetext
Section5.3 Crossvalidation
DataMiningwithWeka
Class2 Lesson6
Crossvalidationresults
IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand
weka.waikato.ac.nz
Lesson2.6:Crossvalidationresults
Class1
GettingstartedwithWeka
Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting
Class3
Simpleclassifiers Lesson2.3Moretraining/testing
Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Lesson2.6:Crossvalidationresults
Iscrossvalidationreallybetterthanrepeatedholdout?
Diabetes dataset
Baselineaccuracy(rules>ZeroR): 65.1%
trees>J48
10foldcrossvalidation 73.8%
withdifferentrandomnumberseed
1 2 3 4 5 6 7 8 9 10
73.8 75.0 75.5 75.5 74.4 75.6 73.6 74.0 74.5 73.0
Lesson2.6:Crossvalidationresults
holdout crossvalidation
(10%) (10fold)
75.3 73.8
77.9 75.0
xi
Samplemean x = 80.5 75.5
n 74.0 75.5
(xi x)2 71.4 74.4
Variance 2 =
n1 70.1 75.6
79.2 73.6
Standarddeviation 71.4 74.0
80.5 74.5
67.5 73.0
x =74.8 x =74.5
= 4.6 = 0.9
Lesson2.6:Crossvalidationresults
Why10fold?E.g.20fold:75.1%
Crossvalidationreallyisbetterthanrepeatedholdout
Itreducesthevarianceoftheestimate
DataMiningwithWeka
DepartmentofComputerScience
UniversityofWaikato
NewZealand
CreativeCommonsAttribution3.0Unported License
creativecommons.org/licenses/by/3.0/
weka.waikato.ac.nz