Text Classification: SNLP 2016

Text Classification
SNLP 2016
CSE, IIT Kharagpur
November 7th, 2016
SNLP 2016 (IIT Kharagpur)
Text Classification
November 7th, 2016
1 / 29
Example: Positive or negative movie review?
Text Classification
November 7th, 2016
2 / 29
Example: Male or Female Author?
Text Classification
November 7th, 2016
3 / 29
Example: What is the subject of this article?
Text Classification
November 7th, 2016
4 / 29
Taxt Classification
Assigning subject categories, topics, or genres

Spam detection
Authorship identification
Age/gender identification
Language identification
Sentiment analysis
...
Text Classification
November 7th, 2016
5 / 29
Text classification: problem definition
Input
A document d
A fixed set of classes C = {c1 , c2 , . . . , cn }
Text Classification
November 7th, 2016
6 / 29
Text classification: problem definition
Input
A document d
A fixed set of classes C = {c1 , c2 , . . . , cn }
Output
A predicted class c C
Text Classification
November 7th, 2016
6 / 29
Classification Methods: Hand-coded rules
Rules based on combinations of words or other features
Spam
Text Classification
November 7th, 2016
7 / 29
Spam
black-list-address OR (dollars AND have been selected)
Text Classification
November 7th, 2016
7 / 29
Spam
black-list-address OR (dollars AND have been selected)
Pros and Cons

Accuracy can be high if rules carefully refined by expert, but building and
maintaining these rules is expensive.
Text Classification
November 7th, 2016
7 / 29
Classification Methods: Supervised Machine Learning
Nave Bayes
Logistic regression
Support-vector machines
...
Text Classification
November 7th, 2016
8 / 29
Nave Bayes Intuition
Simple classification method based on Bayes rule

Relies on very simple representation of document - Bag of words
Text Classification
November 7th, 2016
9 / 29
Bag of words for document classification
Text Classification
November 7th, 2016
10 / 29
Bayes rule for documents and classes

For a document d and a class c
P(c|d) =
P(d|c)P(c)
P(d)
Text Classification
November 7th, 2016
11 / 29
Bayes rule for documents and classes

For a document d and a class c
P(c|d) =
P(d|c)P(c)
P(d)
Nave Bayes Classifier

cMAP = arg max P(c|d)
cC
= arg max P(d|c)P(c)

cC
= arg max P(x1 , x2 , . . . , xn |c)P(c)

cC
Text Classification
November 7th, 2016
11 / 29
Nave Bayes classification assumptions

P(x1 , x2 , . . . , xn |c)
Text Classification
November 7th, 2016
12 / 29

P(x1 , x2 , . . . , xn |c)
Bag of words assumption

Assume that the position of a word in the document doesnt matter
Text Classification
November 7th, 2016
12 / 29

P(x1 , x2 , . . . , xn |c)

Conditional Independence
Assume the feature probabilities P(xi |cj ) are independent given the class cj .
P(x1 , x2 , . . . , xn |c) = P(x1 |c) P(x2 |c) . . . P(xn |c)
Text Classification
November 7th, 2016
12 / 29

P(x1 , x2 , . . . , xn |c)

Conditional Independence
Assume the feature probabilities P(xi |cj ) are independent given the class cj .
P(x1 , x2 , . . . , xn |c) = P(x1 |c) P(x2 |c) . . . P(xn |c)

cNB = arg max P(c) P(x|c)
cC
xX
Text Classification
November 7th, 2016
12 / 29
Learning the model parameters

Maximum Likelihood Estimate
(cj ) = doc count(C = cj )
P
Ndoc
(wi |cj ) =
P
count(wi , cj )
count(w, cj )
wV
Text Classification
November 7th, 2016
13 / 29

P
Ndoc
(wi |cj ) =
P
count(wi , cj )
count(w, cj )
wV
Problem with MLE

Suppose in the training data, we havent seen the word fantastic, classified in
the topic positive.
(fantastic|positive) = 0
P
Text Classification
November 7th, 2016
13 / 29

P
Ndoc
(wi |cj ) =
P
count(wi , cj )
count(w, cj )
wV
Problem with MLE

Suppose in the training data, we havent seen the word fantastic, classified in
the topic positive.
(fantastic|positive) = 0
P
(c) P
(xi |c)
cNB = arg max P
c
xX
Text Classification
November 7th, 2016
13 / 29
Laplace (add-1) smoothing
(wi |c) =
P
count(wi , c) + 1
(count(w, c) + 1)
wV
count(wi , c) + 1
( (count(w, c)) + |V|
wV
Text Classification
November 7th, 2016
14 / 29
A worked example
Text Classification
November 7th, 2016
15 / 29
A worked example
Text Classification
November 7th, 2016
16 / 29
A worked example
Text Classification
November 7th, 2016
17 / 29
A worked example
Text Classification
November 7th, 2016
18 / 29
A worked example
Text Classification
November 7th, 2016
19 / 29
A worked example
Text Classification
November 7th, 2016
20 / 29
Nave Bayes and Language Modeling
In general, NB classifier can use any feature

URL, email addresses, dictionaries, network features
Text Classification
November 7th, 2016
21 / 29

But if we use only the word features and all the words in the text
Nave Bayes has an important similarity to language modeling.
Text Classification
November 7th, 2016
21 / 29

But if we use only the word features and all the words in the text
Nave Bayes has an important similarity to language modeling.
Each class can be thought of as a separate unigram language model.
Text Classification
November 7th, 2016
21 / 29
Nave Bayes as Language Modeling
Which class assigns a higher probability to the sentence?
Text Classification
November 7th, 2016
22 / 29
Nave Bayes: More than Two Classes
Multi-value classification
A document can belong to 0, 1 or > 1 classes
Text Classification
November 7th, 2016
23 / 29
Handling Multi-value classification

For each class c C, build a classifier c to distinguish c from all other
classes c0 C
Text Classification
November 7th, 2016
23 / 29

classes c0 C
Given test-doc d, evaluate it for membership in each class using each c
Text Classification
November 7th, 2016
23 / 29

classes c0 C
d belongs to any class for which c returns true
Text Classification
November 7th, 2016
23 / 29
One-of or multinomial classification

Classes are mutually exclusive: each document in exactly one class
Text Classification
November 7th, 2016
24 / 29

Binary classifiers may also be used

classes c0 C
Text Classification
November 7th, 2016
24 / 29


classes c0 C
Text Classification
November 7th, 2016
24 / 29


classes c0 C
d belongs to one class with maximum score
Text Classification
November 7th, 2016
24 / 29
Evaluation: Constructing Confusion matrix c
For each pair of classes < c1 , c2 > how many documents from c1 were
incorrectly assigned to c2 ? (when c2 6= c1 )
Text Classification
November 7th, 2016
25 / 29
Per class evaluation measures

Recall
Text Classification
November 7th, 2016
26 / 29

Recall
Fraction of docs in class i classified correctly:
cii
cij
j
Text Classification
November 7th, 2016
26 / 29

Recall
cii
cij
j
Precision
Fraction of docs assigned class i that are actually about class i:
Text Classification
November 7th, 2016
26 / 29

Recall
cii
cij
j
Precision
cii
cji
i
Text Classification
November 7th, 2016
26 / 29

Recall
cii
cij
j
Precision
cii
cji
i
Accuracy
cii
Fraction of docs classified correctly: i N
Text Classification
November 7th, 2016
26 / 29
Micro- vs. Macro-Average
If we have more than one class, how do we combine multiple performance

measures into one quantity?
Text Classification
November 7th, 2016
27 / 29

Macro-averaging
Compute performance for each class, then average
Text Classification
November 7th, 2016
27 / 29

Macro-averaging
Compute performance for each class, then average
Micro-averaging
Collect decisions for all the classes, compute contingency table, evaluate.
Text Classification
November 7th, 2016
27 / 29
Text Classification
November 7th, 2016
28 / 29
Macro-averaged precision:
Text Classification
November 7th, 2016
28 / 29
Macro-averaged precision: (0.5 + 0.9)/2 = 0.7

Micro-averaged precision:
Text Classification
November 7th, 2016
28 / 29

Micro-averaged precision: 100/120 = 0.83
Text Classification
November 7th, 2016
28 / 29

Micro-averaged precision: 100/120 = 0.83
Micro-averaged score is dominated by score on common classes
Text Classification
November 7th, 2016
28 / 29
Sample Problem
In a task of text-classification, assume that you want to categorize various documents into 3
classes, politics, sports and nature. The table below shows 16 such documents, their true
labels and the label assigned by your classifier.
Doc.
1
3
5
7
9
11
13
15
Actual
politics
sports
politics
sports
politics
politics
sports
nature
Predicted
sports
sports
politics
sports
politics
nature
politics
nature
Doc.
2
4
6
8
10
12
14
16
Actual
politics
nature
nature
sports
politics
politics
nature
politics
Predicted
politics
politics
nature
nature
politics
sports
politics
nature
Construct the confusion matrix and find the macro-averaged and micro-averaged precision
values. Which category has the highest recall?
Text Classification
November 7th, 2016
29 / 29

Text Classification: SNLP 2016

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Text Classification: SNLP 2016

Enviado por

Direitos autorais:

Formatos disponíveis

Text Classification

November 7th, 2016

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Example: Positive or negative movie review?

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Example: Male or Female Author?

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Example: What is the subject of this article?

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Assigning subject categories, topics, or genres

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Text classification: problem definition

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Text classification: problem definition

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Classification Methods: Hand-coded rules

Rules based on combinations of words or other features

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Classification Methods: Hand-coded rules

Rules based on combinations of words or other features

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Classification Methods: Hand-coded rules

Rules based on combinations of words or other features

Pros and Cons

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Classification Methods: Supervised Machine Learning

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Nave Bayes Intuition

Simple classification method based on Bayes rule

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Bag of words for document classification

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Bayes rule for documents and classes

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Bayes rule for documents and classes

Nave Bayes Classifier

= arg max P(d|c)P(c)

= arg max P(x1 , x2 , . . . , xn |c)P(c)

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Nave Bayes classification assumptions

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Nave Bayes classification assumptions

Bag of words assumption

SNLP 2016 (IIT Kharagpur)

November 7th, 2016

Nave Bayes classification assumptions