Escolar Documentos
Profissional Documentos
Cultura Documentos
Selecting Selecting
Keywords Media
STEP 1C
Classification
STEP 2A STEP 2B
Collecting full Collecting user-
content of generated
news stories comments
U: All irrelevant
articles U*: a subset of U; n
= |R|
M: Articles w/
keywords in full text
(n = 3,027)
R: Articles w/
keywords in
titles
(n = 2,271)
R: Articles w/
keywords in
titles
(n = 2,271)
2014-04
2015-09
2014-04
U: All
articles
U*: a subset of U; n
= |R|
M: Articles w/
keywords in full text
R: Articles w/
keywords in
titles
U: No
keyword
(N = 5.3M)
R: U*: A
Keyword sample
in titles of U
Test
Set
U: No
keyword
(N = 5.3M)
R: U*: A
Keyword sample
in titles of U
Test
Set
U: No
keyword
(N = 5.3M)
R:
Keyword R*: U*:
in titles a random a random
sample of R sample of U
Test
Set
.90 0.8401487 0.9911894 0.9098196 0.9 50%
.991 [.982,
1.000]
.909 [.889, .
930]
.901 [.877, .
925]
.840 [.806, .
875]
.991 [.982,
1.000]
.909 [.889, .
930] [.877, .
.901
925]
.840 [.806, .
875]
Always classified as
relevant (57.0%)
Always classified as
irrelevant (26.8%) Optimal cutoff point ?
Mixed (16.2%)
Evaluation of
STEP 1C
Performance by
Classification Human Coders
STEP 2A STEP 2B
Collecting full Collecting user-
content of generated
news stories comments
STEP 3
Pre-processing
and