Você está na página 1de 56

Text Classification

SNLP 2016
CSE, IIT Kharagpur

November 7th, 2016

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

1 / 29

Example: Positive or negative movie review?

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

2 / 29

Example: Male or Female Author?

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

3 / 29

Example: What is the subject of this article?

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

4 / 29

Taxt Classification

Assigning subject categories, topics, or genres


Spam detection
Authorship identification
Age/gender identification
Language identification
Sentiment analysis
...

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

5 / 29

Text classification: problem definition

Input
A document d
A fixed set of classes C = {c1 , c2 , . . . , cn }

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

6 / 29

Text classification: problem definition

Input
A document d
A fixed set of classes C = {c1 , c2 , . . . , cn }

Output
A predicted class c C

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

6 / 29

Classification Methods: Hand-coded rules

Rules based on combinations of words or other features

Spam

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

7 / 29

Classification Methods: Hand-coded rules

Rules based on combinations of words or other features

Spam
black-list-address OR (dollars AND have been selected)

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

7 / 29

Classification Methods: Hand-coded rules

Rules based on combinations of words or other features

Spam
black-list-address OR (dollars AND have been selected)

Pros and Cons


Accuracy can be high if rules carefully refined by expert, but building and
maintaining these rules is expensive.

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

7 / 29

Classification Methods: Supervised Machine Learning

Nave Bayes
Logistic regression
Support-vector machines
...

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

8 / 29

Nave Bayes Intuition

Simple classification method based on Bayes rule


Relies on very simple representation of document - Bag of words

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

9 / 29

Bag of words for document classification

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

10 / 29

Bayes rule for documents and classes


For a document d and a class c

P(c|d) =

SNLP 2016 (IIT Kharagpur)

P(d|c)P(c)
P(d)

Text Classification

November 7th, 2016

11 / 29

Bayes rule for documents and classes


For a document d and a class c

P(c|d) =

P(d|c)P(c)
P(d)

Nave Bayes Classifier


cMAP = arg max P(c|d)
cC

= arg max P(d|c)P(c)


cC

= arg max P(x1 , x2 , . . . , xn |c)P(c)


cC

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

11 / 29

Nave Bayes classification assumptions


P(x1 , x2 , . . . , xn |c)

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

12 / 29

Nave Bayes classification assumptions


P(x1 , x2 , . . . , xn |c)

Bag of words assumption


Assume that the position of a word in the document doesnt matter

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

12 / 29

Nave Bayes classification assumptions


P(x1 , x2 , . . . , xn |c)

Bag of words assumption


Assume that the position of a word in the document doesnt matter

Conditional Independence
Assume the feature probabilities P(xi |cj ) are independent given the class cj .

P(x1 , x2 , . . . , xn |c) = P(x1 |c) P(x2 |c) . . . P(xn |c)

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

12 / 29

Nave Bayes classification assumptions


P(x1 , x2 , . . . , xn |c)

Bag of words assumption


Assume that the position of a word in the document doesnt matter

Conditional Independence
Assume the feature probabilities P(xi |cj ) are independent given the class cj .

P(x1 , x2 , . . . , xn |c) = P(x1 |c) P(x2 |c) . . . P(xn |c)


cNB = arg max P(c) P(x|c)
cC

SNLP 2016 (IIT Kharagpur)

xX

Text Classification

November 7th, 2016

12 / 29

Learning the model parameters


Maximum Likelihood Estimate
(cj ) = doc count(C = cj )
P
Ndoc
(wi |cj ) =
P

count(wi , cj )
count(w, cj )

wV

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

13 / 29

Learning the model parameters


Maximum Likelihood Estimate
(cj ) = doc count(C = cj )
P
Ndoc
(wi |cj ) =
P

count(wi , cj )
count(w, cj )

wV

Problem with MLE


Suppose in the training data, we havent seen the word fantastic, classified in
the topic positive.

(fantastic|positive) = 0
P

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

13 / 29

Learning the model parameters


Maximum Likelihood Estimate
(cj ) = doc count(C = cj )
P
Ndoc
(wi |cj ) =
P

count(wi , cj )
count(w, cj )

wV

Problem with MLE


Suppose in the training data, we havent seen the word fantastic, classified in
the topic positive.

(fantastic|positive) = 0
P
(c) P
(xi |c)
cNB = arg max P
c

SNLP 2016 (IIT Kharagpur)

xX

Text Classification

November 7th, 2016

13 / 29

Laplace (add-1) smoothing

(wi |c) =
P

count(wi , c) + 1
(count(w, c) + 1)

wV

count(wi , c) + 1
( (count(w, c)) + |V|
wV

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

14 / 29

A worked example

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

15 / 29

A worked example

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

16 / 29

A worked example

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

17 / 29

A worked example

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

18 / 29

A worked example

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

19 / 29

A worked example

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

20 / 29

Nave Bayes and Language Modeling

In general, NB classifier can use any feature


URL, email addresses, dictionaries, network features

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

21 / 29

Nave Bayes and Language Modeling

In general, NB classifier can use any feature


URL, email addresses, dictionaries, network features

But if we use only the word features and all the words in the text
Nave Bayes has an important similarity to language modeling.

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

21 / 29

Nave Bayes and Language Modeling

In general, NB classifier can use any feature


URL, email addresses, dictionaries, network features

But if we use only the word features and all the words in the text
Nave Bayes has an important similarity to language modeling.
Each class can be thought of as a separate unigram language model.

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

21 / 29

Nave Bayes as Language Modeling

Which class assigns a higher probability to the sentence?

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

22 / 29

Nave Bayes: More than Two Classes

Multi-value classification
A document can belong to 0, 1 or > 1 classes

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

23 / 29

Nave Bayes: More than Two Classes

Multi-value classification
A document can belong to 0, 1 or > 1 classes

Handling Multi-value classification


For each class c C, build a classifier c to distinguish c from all other
classes c0 C

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

23 / 29

Nave Bayes: More than Two Classes

Multi-value classification
A document can belong to 0, 1 or > 1 classes

Handling Multi-value classification


For each class c C, build a classifier c to distinguish c from all other
classes c0 C
Given test-doc d, evaluate it for membership in each class using each c

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

23 / 29

Nave Bayes: More than Two Classes

Multi-value classification
A document can belong to 0, 1 or > 1 classes

Handling Multi-value classification


For each class c C, build a classifier c to distinguish c from all other
classes c0 C
Given test-doc d, evaluate it for membership in each class using each c

d belongs to any class for which c returns true

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

23 / 29

Nave Bayes: More than Two Classes

One-of or multinomial classification


Classes are mutually exclusive: each document in exactly one class

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

24 / 29

Nave Bayes: More than Two Classes

One-of or multinomial classification


Classes are mutually exclusive: each document in exactly one class

Binary classifiers may also be used


For each class c C, build a classifier c to distinguish c from all other
classes c0 C

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

24 / 29

Nave Bayes: More than Two Classes

One-of or multinomial classification


Classes are mutually exclusive: each document in exactly one class

Binary classifiers may also be used


For each class c C, build a classifier c to distinguish c from all other
classes c0 C
Given test-doc d, evaluate it for membership in each class using each c

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

24 / 29

Nave Bayes: More than Two Classes

One-of or multinomial classification


Classes are mutually exclusive: each document in exactly one class

Binary classifiers may also be used


For each class c C, build a classifier c to distinguish c from all other
classes c0 C
Given test-doc d, evaluate it for membership in each class using each c

d belongs to one class with maximum score

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

24 / 29

Evaluation: Constructing Confusion matrix c

For each pair of classes < c1 , c2 > how many documents from c1 were
incorrectly assigned to c2 ? (when c2 6= c1 )

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

25 / 29

Per class evaluation measures


Recall

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

26 / 29

Per class evaluation measures


Recall
Fraction of docs in class i classified correctly:

cii

cij
j

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

26 / 29

Per class evaluation measures


Recall
Fraction of docs in class i classified correctly:

cii

cij
j

Precision
Fraction of docs assigned class i that are actually about class i:

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

26 / 29

Per class evaluation measures


Recall
Fraction of docs in class i classified correctly:

cii

cij
j

Precision
Fraction of docs assigned class i that are actually about class i:

cii

cji
i

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

26 / 29

Per class evaluation measures


Recall
Fraction of docs in class i classified correctly:

cii

cij
j

Precision
Fraction of docs assigned class i that are actually about class i:

cii

cji
i

Accuracy

cii

Fraction of docs classified correctly: i N

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

26 / 29

Micro- vs. Macro-Average

If we have more than one class, how do we combine multiple performance


measures into one quantity?

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

27 / 29

Micro- vs. Macro-Average

If we have more than one class, how do we combine multiple performance


measures into one quantity?

Macro-averaging
Compute performance for each class, then average

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

27 / 29

Micro- vs. Macro-Average

If we have more than one class, how do we combine multiple performance


measures into one quantity?

Macro-averaging
Compute performance for each class, then average

Micro-averaging
Collect decisions for all the classes, compute contingency table, evaluate.

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

27 / 29

Micro- vs. Macro-Average

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

28 / 29

Micro- vs. Macro-Average

Macro-averaged precision:

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

28 / 29

Micro- vs. Macro-Average

Macro-averaged precision: (0.5 + 0.9)/2 = 0.7


Micro-averaged precision:

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

28 / 29

Micro- vs. Macro-Average

Macro-averaged precision: (0.5 + 0.9)/2 = 0.7


Micro-averaged precision: 100/120 = 0.83

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

28 / 29

Micro- vs. Macro-Average

Macro-averaged precision: (0.5 + 0.9)/2 = 0.7


Micro-averaged precision: 100/120 = 0.83
Micro-averaged score is dominated by score on common classes

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

28 / 29

Sample Problem
In a task of text-classification, assume that you want to categorize various documents into 3
classes, politics, sports and nature. The table below shows 16 such documents, their true
labels and the label assigned by your classifier.
Doc.
1
3
5
7
9
11
13
15

Actual
politics
sports
politics
sports
politics
politics
sports
nature

Predicted
sports
sports
politics
sports
politics
nature
politics
nature

Doc.
2
4
6
8
10
12
14
16

Actual
politics
nature
nature
sports
politics
politics
nature
politics

Predicted
politics
politics
nature
nature
politics
sports
politics
nature

Construct the confusion matrix and find the macro-averaged and micro-averaged precision
values. Which category has the highest recall?

SNLP 2016 (IIT Kharagpur)

Text Classification

November 7th, 2016

29 / 29

Você também pode gostar