Escolar Documentos
Profissional Documentos
Cultura Documentos
of
Computa%onal
Journalism
Columbia
Journalism
School
Week
4:
Algorithmic
Filtering
September
26,
2014
Journalism
as
a
cycle
CS
Eects
Data
CS
Repor%ng
User
CS
CS
Filtering
User
User
x
x
x
x
x
x
ltering
User
Google
now
indexes
more
web
pages
than
there
are
people
in
the
world
400,000,000
tweets
per
day
es%mated
130,000,000
books
ever
published
- Clay Shirky
System Descrip%on
Scrape
Cluster
events
Cluster
topics
Summarize
Scrape
Handcrabed
list
of
source
URLs
(news
front
pages)
and
links
followed
to
depth
4
Then
extract
the
text
of
each
ar%cle
Cluster Events
Cluster
Events
Surprise!
encode
ar%cles
into
feature
vectors
cosine
distance
func%on
hierarchical
clustering
algorithm
Agglomera%ve
hierarchical
start
with
leaves,
repeatedly
merge
clusters
e.g.
MIN
and
MAX
approaches
Divisive
hierarchical
start
with
root,
repeatedly
split
clusters
e.g.
binary
split
Evalua%ng
clusterings
When
is
one
clustering
beoer
than
another?
Ideally,
wed
like
a
quan%ta%ve
metric.
Evalua%ng
clusterings
When
is
one
clustering
beoer
than
another?
Ideally,
wed
like
a
quan%ta%ve
metric.
This
is
possible
if
we
have
training
data
=
human
generated
clusters.
Available
from
the
TDT2
corpus
(topic
detec%on
and
tracking)
En%ty extrac%on
But
we
need
a
single
distance
func%on
dist(di,dj)
that
takes
into
account
all
informa%on.
dist(i, j) =
distk (i, j)
k=1..3
Regression
t
From
training
data,
dene
perfect
distance
fn:
R(i,j)
=
0
if
i,j
in
same
cluster
=
1
otherwise
Find
wi
that
minimize
2
= ( dist(i, j) R(i, j))
i, j
Primi%ve
opera%on:
what
topic
is
this
story
in?
TF-IDF,
again
Each
category
has
pre-assigned
TF-IDF
coordinate.
Story
category
=
closest
point.
world
category
latest story
nance
category
Cluster
summariza%on
Problem:
given
a
set
of
documents,
write
a
sentence
summarizing
them.
Dicult
problem.
See
references
in
Newsblaster
paper,
and
more
recent
techniques.
Personaliza%on!
Not
every
person
needs
to
see
the
same
news.
Editors
Editors
are
lters.
They
decide
what
stories
to
run,
and
how
prominent
to
make
them.
How
do
they
choose?
-
Eli
Pariser,
How
do
we
recreate
a
front-page
ethos
for
a
digital
world?
Informa%on
diet
The
holy
grail
in
this
model,
as
far
as
Im
concerned,
would
be
a
Firefox
plugin
that
would
passively
watch
your
websurng
behavior
and
characterize
your
personal
informa%on
consump%on.
Over
the
course
of
a
week,
it
might
let
you
know
that
you
hadnt
encountered
any
news
about
La%n
America,
or
remind
you
that
a
full
40%
of
the
pages
you
read
had
to
do
with
Sarah
Palin.
It
wouldnt
necessarily
prescribe
changes
in
your
behavior,
simply
help
you
monitor
your
own
consump%on
in
the
hopes
that
you
might
make
changes.
-
Ethan
Zuckerman,
Playing
the
Internet
with
PMOG