Você está na página 1de 53

Fron%ers

of
Computa%onal Journalism

Columbia Journalism School
Week 4: Algorithmic Filtering

September 26, 2014

Journalism as a cycle
CS

Eects

Data

CS

Repor%ng

User
CS
CS

Filtering

User

User

stories not covered


x

x
x
x

x
x

x
ltering

User

Each day, the Associated Press publishes:



~10,000 text stories
~3,000 photographs
~500 videos
+ radio, interac%ve

more video on YouTube than produced


by TV networks during en%re 20th century


Google now indexes more web pages
than there are people in the world

400,000,000 tweets per day

es%mated 130,000,000 books ever published

10,000 legally-required reports led by


U.S. public companies every day

All New York Times


ar%cles ever =
0.06 terabytes

(13 million stories,
5k per story)

Its not informa%on overload, its lter failure


- Clay Shirky

System Descrip%on

Scrape

Cluster
events

Cluster
topics

Summarize

Scrape
Handcrabed list of source URLs (news front
pages) and links followed to depth 4

Then extract the text of each ar%cle

Text extrac%on from HTML


Ideal world: HTML5 ar%cle tags









The ar%cle element represents a component of a page that
consists of a self-contained composi%on in a document, page,
applica%on, or site and that is intended to be independently
distributable or reusable, e.g. in syndica%on.
- W3C Specica%on

The dismal reality of text extrac%on

Every site is a beau%ful ower.



Text extrac%on from HTML


Newsblaster paper:

For each page examined, if the amount of text in the largest cell
of the page (aber stripping tags and links) is greater than some
par%cular constant (currently 512 characters), it is assumed to
be a news ar%cle, and this text is extracted.

(At least its simple. This was 2002. How oben does this work
now?)



Text extrac%on from HTML


Now mul%ple services/apis to do this, e.g. readability.com







Cluster Events

Cluster Events
Surprise!

encode ar%cles into feature vectors
cosine distance func%on
hierarchical clustering algorithm

Dierent clustering algorithms


Par%%oning
keep adjus%ng clusters un%l convergence
e.g. K-means

Agglomera%ve hierarchical
start with leaves, repeatedly merge clusters
e.g. MIN and MAX approaches

Divisive hierarchical
start with root, repeatedly split clusters
e.g. binary split

But news is an on-line problem...


Ar%cles arrive one at a %me, and must be
clustered immediately.

Cant look forward in %me, cant go back and
reassign.

Greedy algorithm.

Single pass clustering



put first story in its own cluster
repeat
get next story S
look for cluster C with distance < T
if found
put S in C
else
put S in new cluster

Evalua%ng clusterings
When is one clustering beoer than another?

Ideally, wed like a quan%ta%ve metric.

Evalua%ng clusterings
When is one clustering beoer than another?

Ideally, wed like a quan%ta%ve metric.

This is possible if we have training data = human
generated clusters.

Available from the TDT2 corpus (topic detec%on
and tracking)

Error with respect to hand-generated clusters from training data

Features: all words + en%%es

En%ty extrac%on

Names, dates, places.


Services like OpenCalais. Best algorithms use dic%onaries +
probabilis%c parsers.

All words does best!

But maybe possible to combine features to do beoer?

How to combine dierent features?


In Newsblaster case, for every pair of
documents (di,dj) we have three similarity values

dist1(i,j) = TF-IDF on all words


dist2(i,j) = nouns extracted by LinkIt algorithm
dist3(i,j) = en%%es extracted by Nominator algorithm


But we need a single distance func%on dist(di,dj) that
takes into account all informa%on.

Answer: weighted sum distance fns

dist(i, j) =

distk (i, j)

k=1..3

w1 = weight of TF-IDF distance


w2 = weight of distance on LinkIt nouns
w3 = weight of distance on Nominate en%%es

How to nd op%mal wi?


Well, we have training data available.
So

Regression t
From training data, dene perfect distance fn:

R(i,j) = 0 if i,j in same cluster
= 1 otherwise

Find wi that minimize

2
= ( dist(i, j) R(i, j))
i, j

*actually, we use logis%c regression, because its a beoer t to


binary variables like R(i,j)

Combina%on slightly beoer

Also, error less sensi%ve to clustering threshold T

Now sort events into categories


Categories:
U.S., World, Finance, Science and Technology,
Entertainment, Sports.


Primi%ve opera%on: what topic is this story in?

TF-IDF, again
Each category has pre-assigned TF-IDF
coordinate. Story category = closest point.
world
category

latest story

nance
category

Cluster summariza%on
Problem: given a set of documents, write a
sentence summarizing them.

Dicult problem. See references in Newsblaster
paper, and more recent techniques.

Is Newsblaster really a lter?


Aber all, it shows all the news...

Dierences with Google News?

Personaliza%on!
Not every person needs to see the same news.

Filter design problem


Formally, given
U = user preferences, history, characteris%cs
S = current story
{P} = results of func%on on previous stories
{B} = background world knowledge (other users?)

Dene
r(S,U,{P},{B}) in [0...1]

relevance of story S to user U

What makes a ltering algorithm "good"?

Editors
Editors are lters. They decide what stories to run, and how
prominent to make them.

How do they choose?

The Echo Chamber


[Echo chambers are] those Internet spaces where like-minded
people listen only to those people who already agree with them.
...
While most of us had assumed that the Internet would increase
the diversity of opinion, the echo chamber meme says the Net
encourages groups to form that increase the homogeneity of
belief. This isnt simply a factual argument about the topography
carved by trac and links. A tut, tut has been appended: See,
you Web idealists have been shown up humankinds social
nature sucks, just as we always told you!

- David Weinberger, Is there an echo in here?

Graph of poli%cal book sales during 2008 U.S. elec%on, by orgnet.org


From Amazon "users who bought X also bought Y" data.

Retweet network of poli%cal tweets.


From Conover, et. al., Poli0cal Polariza0on on Twi4er

Instagram co-tag graph, highligh%ng three dis%nct topical communi%es: 1) pro-Israeli


(Orange), 2) pro-Pales%nian (Yellow), and 3) Religious / muslim (Purple)
Gilad Lotan, Betaworks

The Filter Bubble


What people care about poli%cally, and what theyre mo%vated
to do something about, is a func%on of what they know about
and what they see in their media. ... People see something about
the decit on the news, and they say, Oh, the decit is the big
problem. If they see something about the environment, they
say the environment is a big problem.

This creates this kind of a feedback loop in which your media


inuences your preferences and your choices; your choices
inuence your media; and you really can go down a long and
narrow path, rather than actually seeing the whole set of issues
in front of us.

- Eli Pariser,
How do we recreate a front-page ethos for a digital world?

The (Algorithmic) Filter Bubble




If we try to present stories that the user will want to
click on... do we end up only telling people what they
want to hear?

If an algorithm only shows us things our friends like,
will we ever see anything that challenges us?

Filter design problem, restated


When should a user see a story?

Aspects to this ques%on:
norma've
personal: what I want
societal: emergent group eects
UI
how do I tell the computer I want?
technical
constrained by algorithmic possibility
economic
cheap enough to deploy widely

Informa%on diet
The holy grail in this model, as far as Im
concerned, would be a Firefox plugin that
would passively watch your websurng
behavior and characterize your personal
informa%on consump%on. Over the course of a
week, it might let you know that you hadnt
encountered any news about La%n America, or
remind you that a full 40% of the pages you
read had to do with Sarah Palin. It wouldnt
necessarily prescribe changes in your behavior,
simply help you monitor your own
consump%on in the hopes that you might make
changes.

- Ethan Zuckerman,
Playing the Internet with PMOG