Escolar Documentos
Profissional Documentos
Cultura Documentos
h1p://mallet.cs.umass.edu
DavidMimno Informa@onExtrac@onandSynthesis Laboratory,DepartmentofCS UMass,Amherst
Outline
AboutMALLET Represen@ngData Classica@on SequenceTagging TopicModeling
Outline
AboutMALLET Represen@ngData Classica@on SequenceTagging TopicModeling
Who?
AndrewMcCallum(mostofthe work) CharlesSu1on,AronCulo1a, GregDruck,KedarBellare, GauravChandalia FernandoPereira,othersat Penn
WhoamI?
ChiefmaintainerofMALLET PrimaryauthorofMALLETtopicmodeling package
Why?
Mo@va@on:textclassica@onand informa@onextrac@on Commercialmachinelearning(Just Research,WhizBang) Analysisandindexingofacademic publica@ons:Cora,Rexa
What?
Textfocus:dataisdiscreteratherthan con@nuous,evenwhenvaluescouldbe con@nuous:
double value = 3.0
How?
Commandlinescripts:
bin/mallet[command][op@on][value] TextUserInterface(tui)classes
DirectJavaAPI
h1p://mallet.cs.umass.edu/api
Most of this talk
History
Version0.4:c2004
Classesinedu.umass.cs.mallet.base.*
Version2.0:c2008
Classesincc.mallet.* Majorchangestonitestatetransducer package bin/malletvs.specializedscripts Java1.5generics
LearningMore
h1p://mallet.cs.umass.edu
QuickStartguides,focusedoncommandline processing Developersguides,withJavaexamples
malletdev@cs.umass.edumailinglist
Lowvolume,butcanbebursty
Outline
AboutMALLET Represen@ngData Classica@on SequenceTagging TopicModeling
ModelsforTextData
Genera@vemodels(Mul@nomials)
NaveBayes HiddenMarkovModels(HMMs) LatentDirichletTopicModels
Discrimina@veRegressionModels
MaxEnt/Logis@cregression Condi@onalRandomFields(CRFs)
Representa@ons
Transformtext documentsto vectorsx1,x2, Retainmeaning ofvectorindices Ideallysparsely
Call me Ishmael.
Document
Representa@ons
Transformtext documentsto vectorsx1,x2, Retainmeaning ofvectorindices Ideallysparsely
Call me Ishmael. 1.0 0.0 0.0 6.0 0.0 3.0
xi
Document
Representa@ons
Elementsofvector arecalledfeature values Example:Feature atrow345is numberof@mes dogappearsin document
1.0 0.0 0.0 6.0 0.0 3.0
xi
DocumentstoVectors
Call me Ishmael.
Document
DocumentstoVectors
Call me Ishmael. Call me Ishmael Tokens
Document
DocumentstoVectors
Call me Ishmael Tokens call me ishmael Tokens
DocumentstoVectors
call me Tokens ishmael 473, 3591, 17 Features 17 ishmael 473 call 3591 me
DocumentstoVectors
473, 3591, 17 Features (sequence) 17 ishmael 473 call 3591 me 17 473 3591 1.0 1.0 1.0
Instances
Emailmessage,webpage,sentence,journal abstract What is it called? Name What is the input? Data Target/Label What is the output? Source
What did it originally look like?
Instances
Name Data Target Source
String TokenSequence ArrayList<Token> FeatureSequence int[] FeatureVector int -> double map
cc.mallet.types
Alphabets
17 ishmael 473 call 3591 me TObjectIntHashMap map ArrayList entries
Alphabets
17 ishmael 473 call 3591 me TObjectIntHashMap map ArrayList entries
for
int lookupIndex(Object o, boolean shouldAdd) Object lookupObject(int index) cc.mallet.types, gnu.trove
Alphabets
17 ishmael 473 call 3591 me TObjectIntHashMap map ArrayList entries
Do not add entries for void stopGrowth() new Objects -- default is to allow growth. void startGrowth() cc.mallet.types, gnu.trove
Crea@ngInstances
Instance constructor method Iterators
new Instance(data, target, name, source)
cc.mallet.pipe.iterator
Crea@ngInstances
FileIterator
/data/bad/ Label from dir name /data/good/ Each instance in its own le
cc.mallet.pipe.iterator
Crea@ngInstances
CsvIterator
1001 1002 Melville Dickens Each instance on its own line Call me Ishmael. Some years ago It was the best of times, it was
^([^\t]+)\t([^\t]+)\t(.*) Name, label, data from regular expression groups. CSV is a lousy name. LineRegexIterator? cc.mallet.pipe.iterator
InstancePipelines
Sequen@al transforma@ons ofinstanceelds (usuallyData) Passan ArrayList<Pipe> toSerialPipes
// data is a String CharSequence2TokenSequence // tokenize with regexp TokenSequenceLowercase // modify each tokens text TokenSequenceRemoveStopwords // drop some tokens TokenSequence2FeatureSequence // convert token Strings to ints FeatureSequence2FeatureVector // lose order, count duplicates
cc.mallet.pipe
InstancePipelines
Asmallnumber ofpipesmodify thetarget eld Therearenow twoalphabets: dataandlabel
// target is a String Target2Label // convert String to int // target is now a Label
cc.mallet.pipe, cc.mallet.types
Labelobjects
Weightsona xedsetof classes Fortraining data,weightfor correctlabelis 1.0,allothers 0.0
cc.mallet.types
implements Labeling int getBestIndex() Label getBestLabel()
InstanceLists
AListof Instanceobjects, alongwitha Pipe,data Alphabet,and LabelAlphabet
InstanceList instances = new InstanceList(pipe); instances.addThruPipe(iterator);
cc.mallet.types
Purngitalltogether
ArrayList<Pipe> pipeList = new ArrayList<Pipe>(); pipeList.add(new pipeList.add(new pipeList.add(new pipeList.add(new Target2Label()); CharSequence2TokenSequence()); TokenSequence2FeatureSequence()); FeatureSequence2FeatureVector());
PersistentStorage
MostMALLET classesuseJava serializa@onto storemodels anddata
ObjectOutputStream oos = new ObjectOutputStream(); oos.writeObject(instances); oos.close();
Pipes, data objects, labelings, etc all need to implement Serializable. Be sure to include custom classes in classpath, or you get a StreamCorruptedException
java.io
Review
Whatarethefourmaineldsinan Instance?
Review
Whatarethefourmaineldsinan Instance? WhataretwowaystogenerateInstances?
Review
Whatarethefourmaineldsinan Instance? WhataretwowaystogenerateInstances? HowdowemodifythevalueofInstance elds?
Review
Whatarethefourmaineldsinan Instance? WhataretwowaystogenerateInstances? HowdowemodifythevalueofInstance elds? Namesomeclassesthatappearinthe dataeld.
Outline
AboutMALLET Represen@ngData Classica@on SequenceTagging TopicModeling
Classierobjects
Classiersmap frominstances todistribu@ons overaxedset ofclasses MaxEnt,Nave Bayes,Decision Trees
cc.mallet.classify Which class is best? (this one!)
Classierobjects
Classiersmap frominstances todistribu@ons overaxedset ofclasses MaxEnt,Nave Bayes,Decision Trees
cc.mallet.classify
Labeling labeling = classifier.classify(instance); Label l = labeling.getBestLabel(); System.out.print(instance + \t); System.out.println(l);
TrainingClassierobjects
Eachtypeof classierhas oneormore ClassierTrainer classes
ClassifierTrainer trainer = new MaxEntTrainer(); Classifier classifier = trainer.train(instances);
cc.mallet.classify
TrainingClassierobjects
Someclassiers require numerical op@miza@onof anobjec@ve func@on.
log P(Labels | Data) = log f(label1, data1, w) + log f(label2, data2, w) + log f(label3, data3, w) +
Maximize w.r.t. w!
cc.mallet.optimize
Parametersw
Associa@on between feature,class label Howmany parametersfor KclassesandN features?
ac@on ac@on ac@on SUFF@on SUFF@on SUFF@on SUFFon SUFFon NN VB JJ NN VB JJ NN VB 0.13 0.1 0.21 1.3 2.1 1.7 0.01 0.02
TrainingClassierobjects
interface Optimizer boolean optimize() Limited-memory BFGS, Conjugate gradient
TrainingClassierobjects
For Optimizable interface
MaxEntOptimizableByLabelLikelihood double[] getParameters() void setParameters(double[] parameters) double getValue() void getValueGradient(double[] buffer)
Evalua@onofClassiers
Create random test/train splits
InstanceList[] instanceLists = instances.split(new Randoms(), new double[] {0.9, 0.1, 0.0});
cc.mallet.types
Evalua@onofClassiers
TheTrial classstores theresultsof classica@ons onan InstanceList (tes@ngor training)
cc.mallet.classify
Trial(Classifier c, InstanceList list) double getAccuracy() double getAverageRank() double getF1(int/Label/Object) double getPrecision() double getRecall()
Review
Ihaveinventedanewclassier:David regression.
WhatclassshouldIimplementtoclassify instances?
Review
Ihaveinventedanewclassier:David regression.
WhatclassshouldIimplementtotrainaDavid regressionclassier?
Review
Ihaveinventedanewclassier:David regression.
IwanttotrainusingByValueGradient.What mathema@calfunc@onsdoIneedtocodeup, andwhatclassshouldIputthemin?
Review
Ihaveinventedanewclassier:David regression.
HowwouldIcheckwhethermynewclassier worksbe1erthanNaveBayes?
Outline
AboutMALLET Represen@ngData Classica@on SequenceTagging TopicModeling
SequenceTagging
Dataoccursin sequences Categoricallabels foreachposi@on Labelsare correlated
DETNNVBSVBG thedoglikesrunning
SequenceTagging
Dataoccursin sequences Categoricallabels foreachposi@on Labelsare correlated
???????? thedoglikesrunning
SequenceTagging
Classica@on:nway SequenceTagging:nTway
NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC
orreddogsonbluetrees
AvoidingExponen@alBlowup
Markovproperty Dynamicprogramming
Andrei Markov
AvoidingExponen@alBlowup
Markovproperty Dynamicprogramming
DETJJNNVB This one Given this one Is independent of these
Andrei Markov
AvoidingExponen@alBlowup
Markovproperty Dynamicprogramming
NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC Andrei Markov
orreddogsonbluetrees
AvoidingExponen@alBlowup
Markovproperty Dynamicprogramming
NN NN NN NN NN JJ JJ JJ JJ JJ PRP PRP PRP PRP PRP VB VB VB VB VB CC CC CC CC CC reddogsonbluetrees
Andrei Markov
AvoidingExponen@alBlowup
Markovproperty Dynamicprogramming
NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC Andrei Markov
dogsonbluetrees
HiddenMarkovModelsand Condi@onalRandomFields
HiddenMarkov Model:fully genera@ve Condi@onal RandomField: condi@onal
P(Labels | Data) = P(Data, Labels) / P(Data)
P(Labels | Data)
HiddenMarkovModelsand Condi@onalRandomFields
HiddenMarkovModel: simple(independent) outputspace Condi@onalRandom Field:arbitrarily complicatedoutputs
NSF-funded
HiddenMarkovModelsand Condi@onalRandomFields
HiddenMarkovModel: simple(independent) outputspace Condi@onalRandom Field:arbitrarily complicatedoutputs
FeatureSequence int[] FeatureVectorSequence FeatureVector[]
Impor@ngData
SimpleTagger format:one wordperline, withinstances delimitedbya blankline
Call VB me PPN Ishmael NNP .. Some JJ years NNS
Impor@ngData
SimpleTagger format:one wordperline, withinstances delimitedbya blankline
Call SUFF-ll VB me TWO_LETTERS PPN Ishmael BIBLICAL_NAME NNP . PUNCTUATION . Some CAPITALIZED JJ years TIME SUFF-s NNS
Impor@ngData
LineGroupIterator SimpleTaggerSentence2TokenSequence() //String to Tokens, handles labels TokenSequence2FeatureVectorSequence() //Token objects to FeatureVectors
cc.mallet.pipe, cc.mallet.pipe.iterator
Impor@ngData
LineGroupIterator SimpleTaggerSentence2TokenSequence() //String to Tokens, handles labels [Pipes that modify tokens] TokenSequence2FeatureVectorSequence() //Token objects to FeatureVectors
cc.mallet.pipe, cc.mallet.pipe.iterator
Impor@ngData
//Ishmael TokenTextCharSuffix(C2=, 2) //Ishmael C2=el RegexMatches(CAP, Pattern.compile(\\p{Lu}.*)) //Ishmael C2=el CAP LexiconMembership(NAME, new File(names), false) //Ishmael C2=el CAP NAME
Slidingwindowfeatures
areddogonabluetree
Slidingwindowfeatures
areddogonabluetree
Slidingwindowfeatures
areddogonabluetree red@-1
Slidingwindowfeatures
areddogonabluetree red@-1 a@-2
Slidingwindowfeatures
areddogonabluetree red@-1 a@-2 on@1
Slidingwindowfeatures
areddogonabluetree red@-1 a@-2 on@1 a@-2_&_red@-1
Impor@ngData
int[][] conjunctions = conjunctions[0] conjunctions[1] conjunctions[2]
new int[3][]; = new int[] { -1 }; next = new int[] { 1 }; = new int[] { -2, -1 };
previous two
cc.mallet.pipe.tsf
Impor@ngData
int[][] conjunctions = conjunctions[0] conjunctions[1] conjunctions[2]
new int[3][]; = new int[] { -1 }; next = new int[] { 1 }; = new int[] { -2, -1 };
previous two
cc.mallet.pipe.tsf
FiniteStateTransducers
Finitestate machineover twoalphabets (observed, hidden)
FiniteStateTransducers
Finitestate machineover twoalphabets (observed, hidden)
DET
P(DET)
FiniteStateTransducers
Finitestate machineover twoalphabets (observed, hidden)
DET the
P(the | DET)
FiniteStateTransducers
Finitestate machineover twoalphabets (observed, hidden)
DETNN the
P(NN | DET)
FiniteStateTransducers
Finitestate machineover twoalphabets (observed, hidden)
DETNN thedog
P(dog | NN)
FiniteStateTransducers
Finitestate machineover twoalphabets (observed, hidden)
DETNNVBS thedog
P(VBS | NN)
Howmanyparameters?
Determines eciencyof training Toomanyleads tooverrng
Trick: Dont allow certain transitions
P(VBS | DET) = 0
Howmanyparameters?
Determines eciencyof training Toomanyleads tooverrng
DETNNVBS thedogruns DETNNVBS thedogruns DETNNVBS thedogruns
FiniteStateTransducers
abstract class Transducer CRF HMM abstract class TransducerTrainer CRFTrainerByLabelLikelihood HMMTrainerByLikelihood
cc.mallet.fst
FiniteStateTransducers
DETNNVBS thedogruns First order: one weight for every pair of labels and observations.
cc.mallet.fst
FiniteStateTransducers
DETNNVBS thedogruns three-quarter order: one weight for every pair of labels and observations.
crf.addStatesForThreeQuarterLabelsConnectedAsIn(instances);
cc.mallet.fst
FiniteStateTransducers
DETNNVBS thedogruns Second order: one weight for every triplet of labels and observations.
crf.addStatesForBiLabelsConnectedAsIn(instances);
cc.mallet.fst
FiniteStateTransducers
DETNNVBS thedogruns Half order: equivalent to independent classiers, except some transitions may be illegal.
crf.addStatesForHalfLabelsConnectedAsIn(instances);
cc.mallet.fst
Trainingatransducer
CRF crf = new CRF(pipe, null); crf.addStatesForLabelsConnectedAsIn(trainingInstances); CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(crf); trainer.train();
cc.mallet.fst
Evalua@ngatransducer
CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(transducer); TransducerEvaluator evaluator = new TokenAccuracyEvaluator(testing, "testing")); trainer.addEvaluator(evaluator); trainer.train();
cc.mallet.fst
Applyingatransducer
Sequence output = transducer.transduce (input); for (int index=0; index < input.size(); input++) { System.out.print(input.get(index) + /); System.out.print(output.get(index) + ); }
cc.mallet.fst
Review
Howdoyouaddnewfeaturesto TokenSequences?
Review
Howdoyouaddnewfeaturesto TokenSequences? Whatarethreefactorsthataectthe numberofparametersinamodel?
Outline
AboutMALLET Represen@ngData Classica@on SequenceTagging TopicModeling
Topics:Seman@cGroups
News Article
Topics:Seman@cGroups
Negotiation
Topics:Seman@cGroups
strike team player deadline union game
Negotiation
Topics:Seman@cGroups
strike team player deadline union game News Article
TrainingaTopicModel
ParallelTopicModel lda = new ParallelTopicModel(numTopics); lda.addInstances(trainingInstances); lda.estimate();
cc.mallet.topics
Evalua@ngaTopicModel
ParallelTopicModel lda = new ParallelTopicModel(numTopics); lda.addInstances(trainingInstances); lda.estimate(); MarginalProbEstimator evaluator = lda.getProbEstimator(); double logLikelihood = evaluator.evaluateLeftToRight(testing, 10, false, null);
cc.mallet.topics
Inferringtopicsfornew documents
ParallelTopicModel lda = new ParallelTopicModel(numTopics); lda.addInstances(trainingInstances); lda.estimate(); TopicInferencer inferencer = lda.getInferencer(); double[] topicProbs = inferencer.getSampledDistribution(instance, 100, 10, 10);
cc.mallet.topics
Morethanwords
Textcollec@ons mixfreetext andstructured data
David Mimno Andrew McCallum UAI 2008
Morethanwords
Textcollec@ons mixfreetext andstructured data
David Mimno Andrew McCallum UAI 2008 Topic models conditioned on arbitrary features using Dirichlet-multinomial regression.
Dirichletmul@nomialRegression (DMR)
Topicparametersforfeature publishedinJMLR
2.27 1.74 1.41 1.40 1.37 1.12 1.21 1.23 1.36 1.44 kernel,kernels,ra@onalkernels,stringkernels,sherkernel bounds,vcdimension,bound,upperbound,lowerbounds reinforcementlearning,learning,reinforcement blindsourcesepara@on,sourcesepara@on,separa@on,channel nearestneighbor,boos@ng,nearestneighbors,adaboost agent,agents,mul@agent,autonomousagents strategies,strategy,adapta@on,adap@ve,driven retrieval,informa@onretrieval,query,queryexpansion web,webpages,webpage,worldwideweb,websites user,users,userinterface,interac@ve,interface
FeatureparametersforRLtopic
2.99 2.88 2.56 2.45 2.19 1.38 1.47 1.54 1.64 3.76 SridharMahadevan ICML KenjiDoya ECML MachineLearningJournal ACL CVPR IEEETrans.PAMI COLING <default>
Topicparametersforfeature publishedinUAI
2.88 2.26 2.25 2.25 2.11 1.29 1.36 1.37 1.50 1.50 bayesiannetworks,bayesiannetwork,beliefnetworks qualita@ve,reasoning,qualita@vereasoning,qualita@vesimula@on probability,probabili@es,probabilitydistribu@ons, uncertainty,symbolic,sketch,primalsketch,uncertain,connec@onist reasoning,logic,defaultreasoning,nonmonotonicreasoning shape,deformable,shapes,contour,ac@vecontour digitallibraries,digitallibrary,digital,library workshopreport,invitedtalk,interna@onalconference,report descrip@ons,descrip@on,top,bo1om,topbo1om nearestneighbor,boos@ng,nearestneighbors,adaboost
Dirichletmul@nomialRegression
Arbitraryobservedfeaturesofdocuments TargetcontainsFeatureVector
PolylingualTopicModeling
Topicsexistinmorelanguagesthanyou couldpossiblylearn Topicallycomparabledocumentsaremuch easiertogetthantransla@onsets Transla@ondic@onaries
coverpairs,notsetsoflanguages misstechnicalvocabulary arentavailableforlowresourcelanguages
Topicsfrom Wikipedia
Alignedinstancelists
dog cat pig chien chat hund schwein
PolylingualTopics
InstanceList[] training = new InstanceList[] { english, german, arabic, mahican }; PolylingualTopicModel pltm = new PolylingualTopicModel(numTopics); pltm.addInstances(training);
MALLEThandsontutorial
h1p://mallet.cs.umass.edu/mallethandson.tar.gz