Class2 DataMiningWithWeka 2013

DataMiningwithWeka
Class2 Lesson1
Beaclassifier!
IanH.Witten
DepartmentofComputerScience
UniversityofWaikato
NewZealand
weka.waikato.ac.nz
Lesson2.1:Beaclassifier!
Class1
GettingstartedwithWeka
Lesson2.1Beaclassifier!
Class2
Evaluation
Lesson2.2Trainingandtesting
Class3
Simpleclassifiers Lesson2.3Moretraining/testing
Lesson2.4Baselineaccuracy
Class4
Moreclassifiers
Lesson2.5Crossvalidation
Class5
Puttingitalltogether Lesson2.6Crossvalidationresults
Interactivedecisiontreeconstruction
Loadsegmentchallenge.arff;lookatdataset
SelectUserClassifier (treeclassifier)
Usethetestsetsegmenttest.arff
Examinedatavisualizerandtreevisualizer
Plotregioncentroidrow vsintensitymean
Rectangle,PolygonandPolylineselectiontools
severalselections
RightclickinTreevisualizer andAcceptthetree
Overtoyou:howwellcanyoudo?
Buildatree:whatstrategydidyouuse?
Givenenoughtime,youcouldproduceaperfect tree
forthedataset
butwoulditperformwellonthetestdata?
Coursetext
Section11.2Doityourself:theUserClassifier
DataMiningwithWeka
Class2 Lesson2
Trainingandtesting
IanH.Witten
UniversityofWaikato
NewZealand
weka.waikato.ac.nz
Lesson2.2:Trainingandtesting
Class1
Class2
Evaluation
Class3
Class4
Moreclassifiers
Class5
Test
data
Training ML Classifier Deploy!

data algorithm
Evaluation
results
Test
data
Training ML Classifier Deploy!

data algorithm
Evaluation
results
Basicassumption:trainingandtestsetsproducedby
independentsamplingfromaninfinitepopulation
UseJ48toanalyzethesegmentdataset
Openfilesegmentchallenge.arff
ChooseJ48decisiontreelearner(trees>J48)
Suppliedtestsetsegmenttest.arff
Runit:96%accuracy
Evaluateontrainingset:99%accuracy
Evaluateonpercentagesplit:95%accuracy
Doitagain:getexactlythesameresult!
Basicassumption:
trainingandtestsetssampledindependentlyfroman
infinitepopulation
Justonedataset? holdsomeoutfortesting
Expectslightvariationinresults
butWekaproducessameresultseachtime
J48onsegmentchallengedataset
Coursetext
Section5.1 Trainingandtesting
DataMiningwithWeka
Class2 Lesson3
Repeatedtrainingandtesting
IanH.Witten
UniversityofWaikato
NewZealand
weka.waikato.ac.nz
Lesson2.3:Repeatedtrainingandtesting
Class1
Class2
Evaluation
Class3
Class4
Moreclassifiers
Class5
EvaluateJ48onsegmentchallenge
0.967
Withsegmentchallenge.arff 0.940
andJ48(trees>J48) 0.940
Setpercentagesplit to90% 0.967
Runit:96.7%accuracy 0.953
Repeat 0.967
0.920
[Moreoptions]Repeatwithseed
0.947
2,3,4,5,6,7,8,910
0.933
0.947
EvaluateJ48onsegmentchallenge
0.967
0.940
Samplemean x =
xi 0.940
n 0.967
Variance 2 =
(xi x)2 0.953
n1 0.967
0.920
Standarddeviation 0.947
0.933
0.947
x =0.949, =0.018
Basicassumption:
trainingandtestsetssampledindependentlyfromaninfinite
population
Expectslightvariationinresults
getitbysettingtherandomnumberseed
Cancalculatemeanandstandarddeviationexperimentally
DataMiningwithWeka
Class2 Lesson4
Baselineaccuracy
IanH.Witten
UniversityofWaikato
NewZealand
weka.waikato.ac.nz
Lesson2.4:Baselineaccuracy
Class1
Class2
Evaluation
Class3
Class4
Moreclassifiers
Class5
Usediabetesdatasetanddefaultholdout
Openfilediabetes.arff
Testoption:Percentagesplit
Trytheseclassifiers:
trees>J48 76%
bayes >NaiveBayes 77%
lazy>IBk 73%
rules>PART 74%
(well learnaboutthemlater)
768instances(500negative,268positive)
Alwaysguessnegative:500/768 65%
rules>ZeroR:mostlikelyclass!
Sometimesbaselineisbest!
Opensupermarket.arff andblindlyapply
rules>ZeroR 64%
trees>J48 63%
bayes >NaiveBayes 63%
lazy>IBk 38%(!!)
rules>PART 63%
Attributesarenotinformative
DontjustapplyWeka toadataset:
youneedtounderstandwhatsgoingon!
Considerwhetherdifferencesarelikelytobe
significant
Alwaystryasimplebaseline,
e.g.rules>ZeroR
Lookatthedataset
DontblindlyapplyWeka:
trytounderstandwhatsgoingon!
DataMiningwithWeka
Class2 Lesson5
Crossvalidation
IanH.Witten
UniversityofWaikato
NewZealand
weka.waikato.ac.nz
Lesson2.5:Crossvalidation
Class1
Class2
Evaluation
Class3
Class4
Moreclassifiers
Class5
Canweimproveuponrepeatedholdout?
(i.e.reducevariance)
Crossvalidation
Stratifiedcrossvalidation
Repeatedholdout
(inLesson2.3,holdout10%fortesting,repeat10times)
(repeat10times)
10foldcrossvalidation
Dividedatasetinto10parts(folds)
Holdouteachpartinturn
Averagetheresults
Eachdatapointusedoncefortesting,9timesfortraining
Stratified crossvalidation
Ensurethateachfoldhastheright
proportionofeachclassvalue
Aftercrossvalidation,Wekaoutputsan
extramodelbuiltontheentiredataset
10%ofdata
10times
ML Classifier
90%ofdata
algorithm
Evaluationresults
11thtime
ML Classifier
100%ofdata Deploy!
algorithm
Crossvalidationbetterthanrepeatedholdout
Stratifiedisevenbetter
With10foldcrossvalidation,Wekainvokesthelearning
algorithm11times
Practicalruleofthumb:
Lotsofdata? usepercentagesplit
Elsestratified10foldcrossvalidation
Coursetext
Section5.3 Crossvalidation
DataMiningwithWeka
Class2 Lesson6
Crossvalidationresults
IanH.Witten
UniversityofWaikato
NewZealand
weka.waikato.ac.nz
Lesson2.6:Crossvalidationresults
Class1
Class2
Evaluation
Class3
Class4
Moreclassifiers
Class5
Iscrossvalidationreallybetterthanrepeatedholdout?
Diabetes dataset
Baselineaccuracy(rules>ZeroR): 65.1%
trees>J48
10foldcrossvalidation 73.8%
withdifferentrandomnumberseed
1 2 3 4 5 6 7 8 9 10
73.8 75.0 75.5 75.5 74.4 75.6 73.6 74.0 74.5 73.0
holdout crossvalidation
(10%) (10fold)
75.3 73.8
77.9 75.0
xi
Samplemean x = 80.5 75.5
n 74.0 75.5
(xi x)2 71.4 74.4
Variance 2 =
n1 70.1 75.6
79.2 73.6
Standarddeviation 71.4 74.0
80.5 74.5
67.5 73.0
x =74.8 x =74.5
= 4.6 = 0.9
Why10fold?E.g.20fold:75.1%
Crossvalidationreallyisbetterthanrepeatedholdout
Itreducesthevarianceoftheestimate
DataMiningwithWeka
UniversityofWaikato
NewZealand
CreativeCommonsAttribution3.0Unported License
creativecommons.org/licenses/by/3.0/
weka.waikato.ac.nz

Class2 DataMiningWithWeka 2013

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Class2 DataMiningWithWeka 2013

Enviado por

Direitos autorais:

Formatos disponíveis

DataMiningwithWeka

Training ML Classifier Deploy!

Training ML Classifier Deploy!

Você também pode gostar