Escolar Documentos
Profissional Documentos
Cultura Documentos
Retterer-Moore,
Qinghao
Wu
Project
Report
Problem Statement
Our
goal
in
this
project
was
to
build
a
classi=ier
that
can
determine
what
genre
of
news
a
given
article
or
forum
post
relates
to.
Ideally,
such
a
classi=ier
would
both
be
able
to
identify
most
of
the
categories
widely
used
in
the
world
of
news,
although
which
set
of
categories
to
use
is
hard
to
choose,
since
many
different
news
sources
have
slightly
different
methods
of
categorizing
news;
maybe
one
source
uses
science
while
another
has
technology,
and
they
have
subtly
different
sets
of
articles
that
they
contain
that
would
be
hard
to
distinguish.
We
ended
up
choosing
one
set
of
categories
based
on
a
good
source
of
training/test
data
we
found,
although
many
alternate
categorizations
could
be
used;
it
would
be
interesting
to
see
whether
our
method
in
fact
maintained
all
of
its
effectiveness
on
those
categorizations.
This
is
a
useful
problem
to
solve,
since
it
has
many
applications;
for
example,
a
search
engine
may
want
to
aggregate
news
from
many
sources
on
a
speci=ic
topic,
like
business
news,
and
so
it
may
want
to
be
able
to
scan
a
wide
variety
of
sources
for
their
content
and
not
just
rely
on
the
classi=ication
the
sources
themselves
use,
since
as
addressed
above,
different
sources
may
classify
news
in
many
subtly
distinct
ways.
It
could
also
be
used
to
analyze
trends
in
reporting,
for
example,
do
articles
about
politics
tend
to
use
more
words
related
to
emotion
than
articles
about
science?
Now
that
our
problem
is
clearly
speci=ied,
let's
describe
the
training
and
test
data
we
used
and
how
we
built
our
classi=ier
to
accurately
categorize
news.
Data
We
got
our
data
from
http://qwone.com/~jason/20Newsgroups/,
a
collection
of
about
20,000
documents
from
newsgroup
forums
in
the
late
1990s.
The
documents
were
split
into
20
or
so
categories,
but
we
combined
some
similar
categories
and
left
out
a
few
categories
to
get
5
broad
categories
similar
to
ones
that
a
news
website
might
use:
automotive
news,
political
news,
sports
news,
computer
news,
and
religion
news.
The
data
we
used
consisted
of
approximately
1000
training
documents
for
each
category
and
1000
test
documents
for
each
category,
all
forum
posts
from
various
newsgroups
related
to
the
category.
It
had
been
at
least
partially
=iltered
for
=iller
words,
so
a
lot
of
common
=iller
words
like
a
and
the
did
not
appear
in
it,
although
some
still
remained
and
had
to
be
dealt
with
by
our
feature
selection
methods.
Overall,
this
data
gave
us
a
large
pool
of
documents
to
train
our
classi=ier
with,
and
the
documents
also
had
a
fairly
high
concentration
of
useful
words
(ie
the
word
team
appeared
many
times
in
the
sports
article,
etc),
so
it
was
a
good
set
of
data
to
build
our
classi=ier
on.
Method
We
start
by
=iguring
out
how
many
times
each
word
appears
in
each
set
of
training
data,
as
word
frequency
is
the
most
basic
metric
to
consider
when
classifying
various
types
of
documents.
We
=ilter
out
any
words
that
appear
fewer
than
50
times
in
the
1000
documents,
to
reduce
the
number
of
words
we
need
to
consider
as
it's
unlikely
that
words
that
appear
that
few
times
will
appear
in
the
documents
we
want
to
classify
later.
That
also
can
help
remove
uncommon
=iller
words
like
although
or
between
that
appear
infrequently
across
all
categories
of
news.
We
then
store
the
word-frequency
pairs
separately
for
each
category
of
document.
A
future
direction
that
would
certainly
improve
our
results
is
if
some
easy
system
existed
to
parse
the
root
of
a
word.
For
example,
if
run,
runs,
runner,
runners,
and
ran
each
had
45
occurrences,
we
wouldn't
consider
any
of
them,
when
really
the
root
word
run
had
200
occurrences
and
we
really
should
consider
it.
Unfortunately,
we
couldn't
=ind
any
ef=icient,
easy
to
implement
methods
for
the
word-parsing
issue,
so
implementing
it
remains
a
future
direction
for
our
project
rather
than
a
current
one.
For
each
word,
for
each
category,
we
assign
the
word
a
weight
for
that
category
representing
how
strongly
correlated
that
word's
appearance
is
with
the
article
belonging
to
that
category.
We
considered
a
few
variants
on
TF-IDF
to
calculate
the
weight
of
each
word.
One
would
be
the
frequency
of
the
word
in
that
speci=ic
category,
divided
by
the
frequency
of
the
word
in
all
categories,
to
measure
how
closely
related
with
the
speci=ic
category
it
is.
We
initially
decided
against
this
because
it
treated
all
words
that
were
exclusive
to
a
category
with
the
same
weight
if
soccer
only
appeared
in
sports
and
appeared
many
times,
and
turnover
only
appeared
in
sports
and
appeared
only
once,
both
would
end
up
with
weight
1,
when
turnover
might
just
be
an
unusual
word
that
just
happened
to
show
up
once
in
sports
and
soccer
clearly
has
a
strong
connection
to
sports
if
it
appears
many
times
What
is
new?
(1)
We
achieved
better
feature
selection
by
changing
the
formula
of
TF-
IDF.
We
do
not
use
traditional
formula
of
TF-IDF
to
get
the
weight
of
each
word
and
do
feature
selection.
We
use
TF/DF
as
the
weight
of
each
word.
Because
we
already
did
some
preprocessing
work
during
TF
calculating
period,
we
keep
every
word
in
this
step
as
feature.
We
do
not
use
the
traditional
formula
of
TF-IDF,
because
we
=ind
that
many
words
occur
in
all
kinds
of
news,
and
the
term
frequency
are
relatively
high.
Thus
words
should
not
be
considered
as
features
or
should
have
pretty
low
weight.
Using
TF/DF
as
the
weight
of
each
word
can
make
sure
that
the
words
occur
in
all
categories
will
get
a
low
weight
and
will
not
in=luence
the
accuracy
of
our
classi=ier.
In
this
way,
the
words
only
occur
in
one
category
can
get
a
high
weight.
(2)
We
improved
algorithm
ef=iciency
and
accuracy
of
the
classi=ier
by
Results
We
use
600
testing
news
of
every
category
for
testing,
and
bellowing
is
the
confusion
matrix
of
the
testing
result.
Confusion matrix
Politics
Politics
577
Comp
1
Auto
2
Religion
19
Sports
1
Others
0
Comp
58
515
7
8
6
0
Auto
77
21
473
20
9
0
Religion
99
12
0
488
1
0
Sports
54
12
2
15
516
1
Politics
Comp
Auto
Religion
Sports
Others
Politics
577
1
2
19
1
0
Comp
58
515
7
8
6
0
Auto
77
21
473
20
9
0
Religion
99
12
0
488
1
0
Sports
54
12
2
15
516
1
From
the
confusion
matrix
of
the
result,
we
can
get
the
recall,
precision
and
accuracy
of
our
classi=ier.
Recall
Precision
Accuracy
Politics
96.17%
66.71%
Comp
85.83%
91.80%
Auto
78.83%
97.72%
Religion
81.33%
88.73%
85.63%
Sports
86%
96.81%
From
the
result,
we
can
see
that
the
classi=ier
has
different
recall
and
precision
on
different
categories.
We
can
see
that
politics
has
a
relatively
high
recall
but
low
precision.
This
is
because
the
news
about
politics
has
obvious
feature,
but
some
news
of
other
categories
are
also
related
with
politics.
The
overall
accuracy
of
our
classi=ier
is
85.63%.
It
is
an
acceptable
result.