Está en la página 1de 120

MachineLearningwithMALLET

h1p://mallet.cs.umass.edu
DavidMimno Informa@onExtrac@onandSynthesis Laboratory,DepartmentofCS UMass,Amherst

Outline
AboutMALLET Represen@ngData Classica@on SequenceTagging TopicModeling

Outline
AboutMALLET Represen@ngData Classica@on SequenceTagging TopicModeling

Who?
AndrewMcCallum(mostofthe work) CharlesSu1on,AronCulo1a, GregDruck,KedarBellare, GauravChandalia FernandoPereira,othersat Penn

WhoamI?
ChiefmaintainerofMALLET PrimaryauthorofMALLETtopicmodeling package

Why?
Mo@va@on:textclassica@onand informa@onextrac@on Commercialmachinelearning(Just Research,WhizBang) Analysisandindexingofacademic publica@ons:Cora,Rexa

What?
Textfocus:dataisdiscreteratherthan con@nuous,evenwhenvaluescouldbe con@nuous:
double value = 3.0

How?
Commandlinescripts:
bin/mallet[command][op@on][value] TextUserInterface(tui)classes

DirectJavaAPI
h1p://mallet.cs.umass.edu/api
Most of this talk

History
Version0.4:c2004
Classesinedu.umass.cs.mallet.base.*

Version2.0:c2008
Classesincc.mallet.* Majorchangestonitestatetransducer package bin/malletvs.specializedscripts Java1.5generics

LearningMore
h1p://mallet.cs.umass.edu
QuickStartguides,focusedoncommandline processing Developersguides,withJavaexamples

malletdev@cs.umass.edumailinglist
Lowvolume,butcanbebursty

Outline
AboutMALLET Represen@ngData Classica@on SequenceTagging TopicModeling

ModelsforTextData
Genera@vemodels(Mul@nomials)
NaveBayes HiddenMarkovModels(HMMs) LatentDirichletTopicModels

Discrimina@veRegressionModels
MaxEnt/Logis@cregression Condi@onalRandomFields(CRFs)

Representa@ons
Transformtext documentsto vectorsx1,x2, Retainmeaning ofvectorindices Ideallysparsely
Call me Ishmael.

Document

Representa@ons
Transformtext documentsto vectorsx1,x2, Retainmeaning ofvectorindices Ideallysparsely
Call me Ishmael. 1.0 0.0 0.0 6.0 0.0 3.0

xi

Document

Representa@ons
Elementsofvector arecalledfeature values Example:Feature atrow345is numberof@mes dogappearsin document
1.0 0.0 0.0 6.0 0.0 3.0

xi

DocumentstoVectors
Call me Ishmael.

Document

DocumentstoVectors
Call me Ishmael. Call me Ishmael Tokens

Document

DocumentstoVectors
Call me Ishmael Tokens call me ishmael Tokens

DocumentstoVectors
call me Tokens ishmael 473, 3591, 17 Features 17 ishmael 473 call 3591 me

DocumentstoVectors
473, 3591, 17 Features (sequence) 17 ishmael 473 call 3591 me 17 473 3591 1.0 1.0 1.0

Features (bag) 17 473 473 3591 3591 ishmael call call me me

Instances
Emailmessage,webpage,sentence,journal abstract What is it called? Name What is the input? Data Target/Label What is the output? Source
What did it originally look like?

Instances
Name Data Target Source
String TokenSequence ArrayList<Token> FeatureSequence int[] FeatureVector int -> double map

cc.mallet.types

Alphabets
17 ishmael 473 call 3591 me TObjectIntHashMap map ArrayList entries

int lookupIndex(Object o, boolean shouldAdd) Object lookupObject(int index) cc.mallet.types, gnu.trove

Alphabets
17 ishmael 473 call 3591 me TObjectIntHashMap map ArrayList entries

for
int lookupIndex(Object o, boolean shouldAdd) Object lookupObject(int index) cc.mallet.types, gnu.trove

Alphabets
17 ishmael 473 call 3591 me TObjectIntHashMap map ArrayList entries

Do not add entries for void stopGrowth() new Objects -- default is to allow growth. void startGrowth() cc.mallet.types, gnu.trove

Crea@ngInstances
Instance constructor method Iterators
new Instance(data, target, name, source)

Iterator<Instance> FileIterator(File[], ) CsvIterator(FileReader, Pattern) ArrayIterator(Object[])

cc.mallet.pipe.iterator

Crea@ngInstances
FileIterator
/data/bad/ Label from dir name /data/good/ Each instance in its own le

cc.mallet.pipe.iterator

Crea@ngInstances
CsvIterator
1001 1002 Melville Dickens Each instance on its own line Call me Ishmael. Some years ago It was the best of times, it was

^([^\t]+)\t([^\t]+)\t(.*) Name, label, data from regular expression groups. CSV is a lousy name. LineRegexIterator? cc.mallet.pipe.iterator

InstancePipelines
Sequen@al transforma@ons ofinstanceelds (usuallyData) Passan ArrayList<Pipe> toSerialPipes
// data is a String CharSequence2TokenSequence // tokenize with regexp TokenSequenceLowercase // modify each tokens text TokenSequenceRemoveStopwords // drop some tokens TokenSequence2FeatureSequence // convert token Strings to ints FeatureSequence2FeatureVector // lose order, count duplicates

cc.mallet.pipe

InstancePipelines
Asmallnumber ofpipesmodify thetarget eld Therearenow twoalphabets: dataandlabel
// target is a String Target2Label // convert String to int // target is now a Label

Alphabet > LabelAlphabet

cc.mallet.pipe, cc.mallet.types

Labelobjects
Weightsona xedsetof classes Fortraining data,weightfor correctlabelis 1.0,allothers 0.0
cc.mallet.types
implements Labeling int getBestIndex() Label getBestLabel()

You cannot create a Label, they are only produced by


LabelAlphabet

InstanceLists
AListof Instanceobjects, alongwitha Pipe,data Alphabet,and LabelAlphabet
InstanceList instances = new InstanceList(pipe); instances.addThruPipe(iterator);

cc.mallet.types

Purngitalltogether
ArrayList<Pipe> pipeList = new ArrayList<Pipe>(); pipeList.add(new pipeList.add(new pipeList.add(new pipeList.add(new Target2Label()); CharSequence2TokenSequence()); TokenSequence2FeatureSequence()); FeatureSequence2FeatureVector());

InstanceList instances = new InstanceList(new SerialPipes(pipeList)); instances.addThruPipe(new FileIterator(. . .));

PersistentStorage
MostMALLET classesuseJava serializa@onto storemodels anddata
ObjectOutputStream oos = new ObjectOutputStream(); oos.writeObject(instances); oos.close();

Pipes, data objects, labelings, etc all need to implement Serializable. Be sure to include custom classes in classpath, or you get a StreamCorruptedException

java.io

Review
Whatarethefourmaineldsinan Instance?

Review
Whatarethefourmaineldsinan Instance? WhataretwowaystogenerateInstances?

Review
Whatarethefourmaineldsinan Instance? WhataretwowaystogenerateInstances? HowdowemodifythevalueofInstance elds?

Review
Whatarethefourmaineldsinan Instance? WhataretwowaystogenerateInstances? HowdowemodifythevalueofInstance elds? Namesomeclassesthatappearinthe dataeld.

Outline
AboutMALLET Represen@ngData Classica@on SequenceTagging TopicModeling

Classierobjects
Classiersmap frominstances todistribu@ons overaxedset ofclasses MaxEnt,Nave Bayes,Decision Trees
cc.mallet.classify Which class is best? (this one!)

Given data watery NN JJ PRP VB CC

Classierobjects
Classiersmap frominstances todistribu@ons overaxedset ofclasses MaxEnt,Nave Bayes,Decision Trees
cc.mallet.classify
Labeling labeling = classifier.classify(instance); Label l = labeling.getBestLabel(); System.out.print(instance + \t); System.out.println(l);

TrainingClassierobjects
Eachtypeof classierhas oneormore ClassierTrainer classes
ClassifierTrainer trainer = new MaxEntTrainer(); Classifier classifier = trainer.train(instances);

cc.mallet.classify

TrainingClassierobjects
Someclassiers require numerical op@miza@onof anobjec@ve func@on.
log P(Labels | Data) = log f(label1, data1, w) + log f(label2, data2, w) + log f(label3, data3, w) +

Maximize w.r.t. w!

cc.mallet.optimize

Parametersw
Associa@on between feature,class label Howmany parametersfor KclassesandN features?
ac@on ac@on ac@on SUFF@on SUFF@on SUFF@on SUFFon SUFFon NN VB JJ NN VB JJ NN VB 0.13 0.1 0.21 1.3 2.1 1.7 0.01 0.02

TrainingClassierobjects
interface Optimizer boolean optimize() Limited-memory BFGS, Conjugate gradient

interface Optimizable interface ByValue interface ByValueGradient

Specic objective functions cc.mallet.optimize

TrainingClassierobjects
For Optimizable interface
MaxEntOptimizableByLabelLikelihood double[] getParameters() void setParameters(double[] parameters) double getValue() void getValueGradient(double[] buffer)

Log likelihood and its rst derivative cc.mallet.classify

Evalua@onofClassiers
Create random test/train splits
InstanceList[] instanceLists = instances.split(new Randoms(), new double[] {0.9, 0.1, 0.0});

90% training 10% testing 0% validation

cc.mallet.types

Evalua@onofClassiers
TheTrial classstores theresultsof classica@ons onan InstanceList (tes@ngor training)
cc.mallet.classify
Trial(Classifier c, InstanceList list) double getAccuracy() double getAverageRank() double getF1(int/Label/Object) double getPrecision() double getRecall()

Review
Ihaveinventedanewclassier:David regression.
WhatclassshouldIimplementtoclassify instances?

Review
Ihaveinventedanewclassier:David regression.
WhatclassshouldIimplementtotrainaDavid regressionclassier?

Review
Ihaveinventedanewclassier:David regression.
IwanttotrainusingByValueGradient.What mathema@calfunc@onsdoIneedtocodeup, andwhatclassshouldIputthemin?

Review
Ihaveinventedanewclassier:David regression.
HowwouldIcheckwhethermynewclassier worksbe1erthanNaveBayes?

Outline
AboutMALLET Represen@ngData Classica@on SequenceTagging TopicModeling

SequenceTagging
Dataoccursin sequences Categoricallabels foreachposi@on Labelsare correlated
DETNNVBSVBG thedoglikesrunning

SequenceTagging
Dataoccursin sequences Categoricallabels foreachposi@on Labelsare correlated
???????? thedoglikesrunning

SequenceTagging
Classica@on:nway SequenceTagging:nTway
NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC

orreddogsonbluetrees

AvoidingExponen@alBlowup
Markovproperty Dynamicprogramming

Andrei Markov

AvoidingExponen@alBlowup
Markovproperty Dynamicprogramming
DETJJNNVB This one Given this one Is independent of these

Andrei Markov

AvoidingExponen@alBlowup
Markovproperty Dynamicprogramming
NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC Andrei Markov

orreddogsonbluetrees

AvoidingExponen@alBlowup
Markovproperty Dynamicprogramming
NN NN NN NN NN JJ JJ JJ JJ JJ PRP PRP PRP PRP PRP VB VB VB VB VB CC CC CC CC CC reddogsonbluetrees

Andrei Markov

AvoidingExponen@alBlowup
Markovproperty Dynamicprogramming
NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC NN JJ PRP VB CC Andrei Markov

dogsonbluetrees

HiddenMarkovModelsand Condi@onalRandomFields
HiddenMarkov Model:fully genera@ve Condi@onal RandomField: condi@onal
P(Labels | Data) = P(Data, Labels) / P(Data)

P(Labels | Data)

HiddenMarkovModelsand Condi@onalRandomFields
HiddenMarkovModel: simple(independent) outputspace Condi@onalRandom Field:arbitrarily complicatedoutputs
NSF-funded

NSF-funded CAPITALIZED HYPHENATED ENDS-WITH-ed ENDS-WITH-d

HiddenMarkovModelsand Condi@onalRandomFields
HiddenMarkovModel: simple(independent) outputspace Condi@onalRandom Field:arbitrarily complicatedoutputs
FeatureSequence int[] FeatureVectorSequence FeatureVector[]

Impor@ngData
SimpleTagger format:one wordperline, withinstances delimitedbya blankline
Call VB me PPN Ishmael NNP .. Some JJ years NNS

Impor@ngData
SimpleTagger format:one wordperline, withinstances delimitedbya blankline
Call SUFF-ll VB me TWO_LETTERS PPN Ishmael BIBLICAL_NAME NNP . PUNCTUATION . Some CAPITALIZED JJ years TIME SUFF-s NNS

Impor@ngData
LineGroupIterator SimpleTaggerSentence2TokenSequence() //String to Tokens, handles labels TokenSequence2FeatureVectorSequence() //Token objects to FeatureVectors

cc.mallet.pipe, cc.mallet.pipe.iterator

Impor@ngData
LineGroupIterator SimpleTaggerSentence2TokenSequence() //String to Tokens, handles labels [Pipes that modify tokens] TokenSequence2FeatureVectorSequence() //Token objects to FeatureVectors

cc.mallet.pipe, cc.mallet.pipe.iterator

Impor@ngData
//Ishmael TokenTextCharSuffix(C2=, 2) //Ishmael C2=el RegexMatches(CAP, Pattern.compile(\\p{Lu}.*)) //Ishmael C2=el CAP LexiconMembership(NAME, new File(names), false) //Ishmael C2=el CAP NAME

must match entire string

one name per line ignore case? cc.mallet.pipe.tsf

Slidingwindowfeatures
areddogonabluetree

Slidingwindowfeatures
areddogonabluetree

Slidingwindowfeatures
areddogonabluetree red@-1

Slidingwindowfeatures
areddogonabluetree red@-1 a@-2

Slidingwindowfeatures
areddogonabluetree red@-1 a@-2 on@1

Slidingwindowfeatures
areddogonabluetree red@-1 a@-2 on@1 a@-2_&_red@-1

Impor@ngData
int[][] conjunctions = conjunctions[0] conjunctions[1] conjunctions[2]

previous position position

new int[3][]; = new int[] { -1 }; next = new int[] { 1 }; = new int[] { -2, -1 };

OffsetConjunctions(conjunctions) // a@-2_&_red@-1 on@1

previous two

cc.mallet.pipe.tsf

Impor@ngData
int[][] conjunctions = conjunctions[0] conjunctions[1] conjunctions[2]

previous position position

new int[3][]; = new int[] { -1 }; next = new int[] { 1 }; = new int[] { -2, -1 };

TokenTextCharSuffix("C1=", 1) OffsetConjunctions(conjunctions) // a@-2_&_red@-1 a@-2_&_C1=d@-1

previous two

cc.mallet.pipe.tsf

FiniteStateTransducers
Finitestate machineover twoalphabets (observed, hidden)

FiniteStateTransducers
Finitestate machineover twoalphabets (observed, hidden)
DET

P(DET)

FiniteStateTransducers
Finitestate machineover twoalphabets (observed, hidden)
DET the

P(the | DET)

FiniteStateTransducers
Finitestate machineover twoalphabets (observed, hidden)
DETNN the

P(NN | DET)

FiniteStateTransducers
Finitestate machineover twoalphabets (observed, hidden)
DETNN thedog

P(dog | NN)

FiniteStateTransducers
Finitestate machineover twoalphabets (observed, hidden)
DETNNVBS thedog

P(VBS | NN)

Howmanyparameters?
Determines eciencyof training Toomanyleads tooverrng
Trick: Dont allow certain transitions

P(VBS | DET) = 0

Howmanyparameters?
Determines eciencyof training Toomanyleads tooverrng
DETNNVBS thedogruns DETNNVBS thedogruns DETNNVBS thedogruns

FiniteStateTransducers
abstract class Transducer CRF HMM abstract class TransducerTrainer CRFTrainerByLabelLikelihood HMMTrainerByLikelihood

cc.mallet.fst

FiniteStateTransducers
DETNNVBS thedogruns First order: one weight for every pair of labels and observations.

CRF crf = new CRF(pipe, null); crf.addFullyConnectedStates(); // or crf.addStatesForLabelsConnectedAsIn(instances);

cc.mallet.fst

FiniteStateTransducers
DETNNVBS thedogruns three-quarter order: one weight for every pair of labels and observations.

crf.addStatesForThreeQuarterLabelsConnectedAsIn(instances);

cc.mallet.fst

FiniteStateTransducers
DETNNVBS thedogruns Second order: one weight for every triplet of labels and observations.

crf.addStatesForBiLabelsConnectedAsIn(instances);

cc.mallet.fst

FiniteStateTransducers
DETNNVBS thedogruns Half order: equivalent to independent classiers, except some transitions may be illegal.

crf.addStatesForHalfLabelsConnectedAsIn(instances);

cc.mallet.fst

Trainingatransducer
CRF crf = new CRF(pipe, null); crf.addStatesForLabelsConnectedAsIn(trainingInstances); CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(crf); trainer.train();

cc.mallet.fst

Evalua@ngatransducer
CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(transducer); TransducerEvaluator evaluator = new TokenAccuracyEvaluator(testing, "testing")); trainer.addEvaluator(evaluator); trainer.train();

cc.mallet.fst

Applyingatransducer
Sequence output = transducer.transduce (input); for (int index=0; index < input.size(); input++) { System.out.print(input.get(index) + /); System.out.print(output.get(index) + ); }

cc.mallet.fst

Review
Howdoyouaddnewfeaturesto TokenSequences?

Review
Howdoyouaddnewfeaturesto TokenSequences? Whatarethreefactorsthataectthe numberofparametersinamodel?

Outline
AboutMALLET Represen@ngData Classica@on SequenceTagging TopicModeling

Topics:Seman@cGroups

News Article

Topics:Seman@cGroups

Sports News Article

Negotiation

Topics:Seman@cGroups
strike team player deadline union game

Sports News Article

Negotiation

Topics:Seman@cGroups
strike team player deadline union game News Article

SeriesYankeesSoxRedWorldLeaguegameBostonteam gamesbaseballMetsGameserieswonClemensBraves Yankeeteams

playersLeagueownersleaguebaseballunioncommissioner BaseballAssocia@onlaborCommissionerFootballmajor teamsSeligagreementstriketeambargaining

TrainingaTopicModel
ParallelTopicModel lda = new ParallelTopicModel(numTopics); lda.addInstances(trainingInstances); lda.estimate();

cc.mallet.topics

Evalua@ngaTopicModel
ParallelTopicModel lda = new ParallelTopicModel(numTopics); lda.addInstances(trainingInstances); lda.estimate(); MarginalProbEstimator evaluator = lda.getProbEstimator(); double logLikelihood = evaluator.evaluateLeftToRight(testing, 10, false, null);

cc.mallet.topics

Inferringtopicsfornew documents
ParallelTopicModel lda = new ParallelTopicModel(numTopics); lda.addInstances(trainingInstances); lda.estimate(); TopicInferencer inferencer = lda.getInferencer(); double[] topicProbs = inferencer.getSampledDistribution(instance, 100, 10, 10);

cc.mallet.topics

Morethanwords
Textcollec@ons mixfreetext andstructured data
David Mimno Andrew McCallum UAI 2008

Morethanwords
Textcollec@ons mixfreetext andstructured data
David Mimno Andrew McCallum UAI 2008 Topic models conditioned on arbitrary features using Dirichlet-multinomial regression.

Dirichletmul@nomialRegression (DMR)

Thecorpusspeciesavectorofrealvalued features(x)foreachdocument,oflengthF. EachtopichasanFlengthvectorof parameters.

Topicparametersforfeature publishedinJMLR
2.27 1.74 1.41 1.40 1.37 1.12 1.21 1.23 1.36 1.44 kernel,kernels,ra@onalkernels,stringkernels,sherkernel bounds,vcdimension,bound,upperbound,lowerbounds reinforcementlearning,learning,reinforcement blindsourcesepara@on,sourcesepara@on,separa@on,channel nearestneighbor,boos@ng,nearestneighbors,adaboost agent,agents,mul@agent,autonomousagents strategies,strategy,adapta@on,adap@ve,driven retrieval,informa@onretrieval,query,queryexpansion web,webpages,webpage,worldwideweb,websites user,users,userinterface,interac@ve,interface

FeatureparametersforRLtopic
2.99 2.88 2.56 2.45 2.19 1.38 1.47 1.54 1.64 3.76 SridharMahadevan ICML KenjiDoya ECML MachineLearningJournal ACL CVPR IEEETrans.PAMI COLING <default>

Topicparametersforfeature publishedinUAI
2.88 2.26 2.25 2.25 2.11 1.29 1.36 1.37 1.50 1.50 bayesiannetworks,bayesiannetwork,beliefnetworks qualita@ve,reasoning,qualita@vereasoning,qualita@vesimula@on probability,probabili@es,probabilitydistribu@ons, uncertainty,symbolic,sketch,primalsketch,uncertain,connec@onist reasoning,logic,defaultreasoning,nonmonotonicreasoning shape,deformable,shapes,contour,ac@vecontour digitallibraries,digitallibrary,digital,library workshopreport,invitedtalk,interna@onalconference,report descrip@ons,descrip@on,top,bo1om,topbo1om nearestneighbor,boos@ng,nearestneighbors,adaboost

FeatureparametersforBayes netstopic 2.88 UAI


2.41 2.23 2.15 2.04 1.16 1.38 1.50 2.24 3.36 MaryAnneWilliams AshrafM.Abdelbar PhilippeSmets LoopyBeliefPropaga@onforApproximateInference(Murphy,Weiss, andJordan,UAI,1999) Probabilis@cSeman@csforNonmonotonicReasoning(Pearl,KR, 1989) COLING NeuralNetworks ICRA <default>

Dirichletmul@nomialRegression
Arbitraryobservedfeaturesofdocuments TargetcontainsFeatureVector

DMRTopicModel dmr = new DMRTopicModel (numTopics); dmr.addInstances(training); dmr.estimate(); dmr.writeParameters(new File("dmr.parameters"));

PolylingualTopicModeling
Topicsexistinmorelanguagesthanyou couldpossiblylearn Topicallycomparabledocumentsaremuch easiertogetthantransla@onsets Transla@ondic@onaries
coverpairs,notsetsoflanguages misstechnicalvocabulary arentavailableforlowresourcelanguages

Topicsfrom European Parliament Proceedings

Topicsfrom European Parliament Proceedings

Topicsfrom Wikipedia

Alignedinstancelists
dog cat pig chien chat hund schwein

PolylingualTopics
InstanceList[] training = new InstanceList[] { english, german, arabic, mahican }; PolylingualTopicModel pltm = new PolylingualTopicModel(numTopics); pltm.addInstances(training);

MALLEThandsontutorial
h1p://mallet.cs.umass.edu/mallethandson.tar.gz

También podría gustarte