Algorithmic Filtering. Computational Journalism Week 4

Fron%ers
of
Computa%onal Journalism

Columbia Journalism School
Week 4: Algorithmic Filtering

September 26, 2014

Journalism as a cycle
CS
Eects
Data
CS
Repor%ng
User
CS
CS
Filtering
User
User
stories not covered

x
x
x
x
x
x
x
ltering
User
Each day, the Associated Press publishes:

~10,000 text stories
~3,000 photographs
~500 videos
+ radio, interac%ve

more video on YouTube than produced

by TV networks during en%re 20th century

Google now indexes more web pages
than there are people in the world

400,000,000 tweets per day

es%mated 130,000,000 books ever published

10,000 legally-required reports led by

U.S. public companies every day
All New York Times

ar%cles ever =
0.06 terabytes

(13 million stories,
5k per story)
Its not informa%on overload, its lter failure

- Clay Shirky
System Descrip%on
Scrape
Cluster
events
Cluster
topics
Summarize
Scrape
Handcrabed list of source URLs (news front
pages) and links followed to depth 4

Then extract the text of each ar%cle

Text extrac%on from HTML

Ideal world: HTML5 ar%cle tags

The ar%cle element represents a component of a page that
consists of a self-contained composi%on in a document, page,
applica%on, or site and that is intended to be independently
distributable or reusable, e.g. in syndica%on.
- W3C Specica%on
The dismal reality of text extrac%on
Every site is a beau%ful ower.


Newsblaster paper:

For each page examined, if the amount of text in the largest cell
of the page (aber stripping tags and links) is greater than some
par%cular constant (currently 512 characters), it is assumed to
be a news ar%cle, and this text is extracted.

(At least its simple. This was 2002. How oben does this work
now?)


Now mul%ple services/apis to do this, e.g. readability.com

Cluster Events
Cluster Events
Surprise!

encode ar%cles into feature vectors
cosine distance func%on
hierarchical clustering algorithm
Dierent clustering algorithms

Par%%oning
keep adjus%ng clusters un%l convergence
e.g. K-means
Agglomera%ve hierarchical
start with leaves, repeatedly merge clusters
e.g. MIN and MAX approaches
Divisive hierarchical
start with root, repeatedly split clusters
e.g. binary split
But news is an on-line problem...

Ar%cles arrive one at a %me, and must be
clustered immediately.

Cant look forward in %me, cant go back and
reassign.

Greedy algorithm.
Single pass clustering

put first story in its own cluster
repeat
get next story S
look for cluster C with distance < T
if found
put S in C
else
put S in new cluster
Evalua%ng clusterings
When is one clustering beoer than another?

Ideally, wed like a quan%ta%ve metric.

Evalua%ng clusterings
When is one clustering beoer than another?

Ideally, wed like a quan%ta%ve metric.

This is possible if we have training data = human
generated clusters.

Available from the TDT2 corpus (topic detec%on
and tracking)

Error with respect to hand-generated clusters from training data
Features: all words + en%%es
En%ty extrac%on
Names, dates, places.

Services like OpenCalais. Best algorithms use dic%onaries +
probabilis%c parsers.
All words does best!
But maybe possible to combine features to do beoer?
How to combine dierent features?

In Newsblaster case, for every pair of
documents (di,dj) we have three similarity values

dist1(i,j) = TF-IDF on all words

dist2(i,j) = nouns extracted by LinkIt algorithm
dist3(i,j) = en%%es extracted by Nominator algorithm

But we need a single distance func%on dist(di,dj) that
takes into account all informa%on.
Answer: weighted sum distance fns
dist(i, j) =
distk (i, j)
k=1..3
w1 = weight of TF-IDF distance

w2 = weight of distance on LinkIt nouns
w3 = weight of distance on Nominate en%%es

How to nd op%mal wi?

Well, we have training data available.
So

Regression t
From training data, dene perfect distance fn:

R(i,j) = 0 if i,j in same cluster
= 1 otherwise

Find wi that minimize

2
= ( dist(i, j) R(i, j))
i, j
*actually, we use logis%c regression, because its a beoer t to

binary variables like R(i,j)
Combina%on slightly beoer
Also, error less sensi%ve to clustering threshold T
Now sort events into categories

Categories:
U.S., World, Finance, Science and Technology,
Entertainment, Sports.

Primi%ve opera%on: what topic is this story in?
TF-IDF, again
Each category has pre-assigned TF-IDF
coordinate. Story category = closest point.
world
category
latest story
nance
category
Cluster summariza%on
Problem: given a set of documents, write a
sentence summarizing them.

Dicult problem. See references in Newsblaster
paper, and more recent techniques.
Is Newsblaster really a lter?

Aber all, it shows all the news...

Dierences with Google News?

Personaliza%on!
Not every person needs to see the same news.
Filter design problem

Formally, given
U = user preferences, history, characteris%cs
S = current story
{P} = results of func%on on previous stories
{B} = background world knowledge (other users?)

Dene
r(S,U,{P},{B}) in [0...1]

relevance of story S to user U
What makes a ltering algorithm "good"?
Editors
Editors are lters. They decide what stories to run, and how
prominent to make them.

How do they choose?

The Echo Chamber

[Echo chambers are] those Internet spaces where like-minded
people listen only to those people who already agree with them.
...
While most of us had assumed that the Internet would increase
the diversity of opinion, the echo chamber meme says the Net
encourages groups to form that increase the homogeneity of
belief. This isnt simply a factual argument about the topography
carved by trac and links. A tut, tut has been appended: See,
you Web idealists have been shown up humankinds social
nature sucks, just as we always told you!

- David Weinberger, Is there an echo in here?

Graph of poli%cal book sales during 2008 U.S. elec%on, by orgnet.org

From Amazon "users who bought X also bought Y" data.
Retweet network of poli%cal tweets.

From Conover, et. al., Poli0cal Polariza0on on Twi4er
Instagram co-tag graph, highligh%ng three dis%nct topical communi%es: 1) pro-Israeli

(Orange), 2) pro-Pales%nian (Yellow), and 3) Religious / muslim (Purple)
Gilad Lotan, Betaworks
The Filter Bubble

What people care about poli%cally, and what theyre mo%vated
to do something about, is a func%on of what they know about
and what they see in their media. ... People see something about
the decit on the news, and they say, Oh, the decit is the big
problem. If they see something about the environment, they
say the environment is a big problem.

This creates this kind of a feedback loop in which your media

inuences your preferences and your choices; your choices
inuence your media; and you really can go down a long and
narrow path, rather than actually seeing the whole set of issues
in front of us.

- Eli Pariser,
How do we recreate a front-page ethos for a digital world?
The (Algorithmic) Filter Bubble

If we try to present stories that the user will want to
click on... do we end up only telling people what they
want to hear?

If an algorithm only shows us things our friends like,
will we ever see anything that challenges us?
Filter design problem, restated

When should a user see a story?

Aspects to this ques%on:
norma've
personal: what I want
societal: emergent group eects
UI
how do I tell the computer I want?
technical
constrained by algorithmic possibility
economic
cheap enough to deploy widely
Informa%on diet
The holy grail in this model, as far as Im
concerned, would be a Firefox plugin that
would passively watch your websurng
behavior and characterize your personal
informa%on consump%on. Over the course of a
week, it might let you know that you hadnt
encountered any news about La%n America, or
remind you that a full 40% of the pages you
read had to do with Sarah Palin. It wouldnt
necessarily prescribe changes in your behavior,
simply help you monitor your own
consump%on in the hopes that you might make
changes.

- Ethan Zuckerman,
Playing the Internet with PMOG

Algorithmic Filtering. Computational Journalism Week 4

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Algorithmic Filtering. Computational Journalism Week 4

Enviado por

Direitos autorais:

Formatos disponíveis

Fron%ers

stories not covered

Each day, the Associated Press publishes:

more video on YouTube than produced

10,000 legally-required reports led by

All New York Times

Its not informa%on overload, its lter failure

Text extrac%on from HTML

The dismal reality of text extrac%on

Every site is a beau%ful ower.

Text extrac%on from HTML

Text extrac%on from HTML

Dierent clustering algorithms

But news is an on-line problem...

Single pass clustering

Error with respect to hand-generated clusters from training data

Features: all words + en%%es

Names, dates, places.

All words does best!

But maybe possible to combine features to do beoer?

How to combine dierent features?

dist1(i,j) = TF-IDF on all words

Answer: weighted sum distance fns

w1 = weight of TF-IDF distance

How to nd op%mal wi?

*actually, we use logis%c regression, because its a beoer t to

Combina%on slightly beoer

Also, error less sensi%ve to clustering threshold T

Now sort events into categories

Is Newsblaster really a lter?

Filter design problem

What makes a ltering algorithm "good"?

The Echo Chamber

Graph of poli%cal book sales during 2008 U.S. elec%on, by orgnet.org

Retweet network of poli%cal tweets.

Instagram co-tag graph, highligh%ng three dis%nct topical communi%es: 1) pro-Israeli

The Filter Bubble

This creates this kind of a feedback loop in which your media

The (Algorithmic) Filter Bubble

Filter design problem, restated

Você também pode gostar