Escolar Documentos
Profissional Documentos
Cultura Documentos
SNLP 2016
CSE, IIT Kharagpur
Text Classification
1 / 29
Text Classification
2 / 29
Text Classification
3 / 29
Text Classification
4 / 29
Taxt Classification
Text Classification
5 / 29
Input
A document d
A fixed set of classes C = {c1 , c2 , . . . , cn }
Text Classification
6 / 29
Input
A document d
A fixed set of classes C = {c1 , c2 , . . . , cn }
Output
A predicted class c C
Text Classification
6 / 29
Spam
Text Classification
7 / 29
Spam
black-list-address OR (dollars AND have been selected)
Text Classification
7 / 29
Spam
black-list-address OR (dollars AND have been selected)
Text Classification
7 / 29
Nave Bayes
Logistic regression
Support-vector machines
...
Text Classification
8 / 29
Text Classification
9 / 29
Text Classification
10 / 29
P(c|d) =
P(d|c)P(c)
P(d)
Text Classification
11 / 29
P(c|d) =
P(d|c)P(c)
P(d)
Text Classification
11 / 29
Text Classification
12 / 29
Text Classification
12 / 29
Conditional Independence
Assume the feature probabilities P(xi |cj ) are independent given the class cj .
Text Classification
12 / 29
Conditional Independence
Assume the feature probabilities P(xi |cj ) are independent given the class cj .
xX
Text Classification
12 / 29
count(wi , cj )
count(w, cj )
wV
Text Classification
13 / 29
count(wi , cj )
count(w, cj )
wV
(fantastic|positive) = 0
P
Text Classification
13 / 29
count(wi , cj )
count(w, cj )
wV
(fantastic|positive) = 0
P
(c) P
(xi |c)
cNB = arg max P
c
xX
Text Classification
13 / 29
(wi |c) =
P
count(wi , c) + 1
(count(w, c) + 1)
wV
count(wi , c) + 1
( (count(w, c)) + |V|
wV
Text Classification
14 / 29
A worked example
Text Classification
15 / 29
A worked example
Text Classification
16 / 29
A worked example
Text Classification
17 / 29
A worked example
Text Classification
18 / 29
A worked example
Text Classification
19 / 29
A worked example
Text Classification
20 / 29
Text Classification
21 / 29
But if we use only the word features and all the words in the text
Nave Bayes has an important similarity to language modeling.
Text Classification
21 / 29
But if we use only the word features and all the words in the text
Nave Bayes has an important similarity to language modeling.
Each class can be thought of as a separate unigram language model.
Text Classification
21 / 29
Text Classification
22 / 29
Multi-value classification
A document can belong to 0, 1 or > 1 classes
Text Classification
23 / 29
Multi-value classification
A document can belong to 0, 1 or > 1 classes
Text Classification
23 / 29
Multi-value classification
A document can belong to 0, 1 or > 1 classes
Text Classification
23 / 29
Multi-value classification
A document can belong to 0, 1 or > 1 classes
Text Classification
23 / 29
Text Classification
24 / 29
Text Classification
24 / 29
Text Classification
24 / 29
Text Classification
24 / 29
For each pair of classes < c1 , c2 > how many documents from c1 were
incorrectly assigned to c2 ? (when c2 6= c1 )
Text Classification
25 / 29
Text Classification
26 / 29
cii
cij
j
Text Classification
26 / 29
cii
cij
j
Precision
Fraction of docs assigned class i that are actually about class i:
Text Classification
26 / 29
cii
cij
j
Precision
Fraction of docs assigned class i that are actually about class i:
cii
cji
i
Text Classification
26 / 29
cii
cij
j
Precision
Fraction of docs assigned class i that are actually about class i:
cii
cji
i
Accuracy
cii
Text Classification
26 / 29
Text Classification
27 / 29
Macro-averaging
Compute performance for each class, then average
Text Classification
27 / 29
Macro-averaging
Compute performance for each class, then average
Micro-averaging
Collect decisions for all the classes, compute contingency table, evaluate.
Text Classification
27 / 29
Text Classification
28 / 29
Macro-averaged precision:
Text Classification
28 / 29
Text Classification
28 / 29
Text Classification
28 / 29
Text Classification
28 / 29
Sample Problem
In a task of text-classification, assume that you want to categorize various documents into 3
classes, politics, sports and nature. The table below shows 16 such documents, their true
labels and the label assigned by your classifier.
Doc.
1
3
5
7
9
11
13
15
Actual
politics
sports
politics
sports
politics
politics
sports
nature
Predicted
sports
sports
politics
sports
politics
nature
politics
nature
Doc.
2
4
6
8
10
12
14
16
Actual
politics
nature
nature
sports
politics
politics
nature
politics
Predicted
politics
politics
nature
nature
politics
sports
politics
nature
Construct the confusion matrix and find the macro-averaged and micro-averaged precision
values. Which category has the highest recall?
Text Classification
29 / 29