Escolar Documentos
Profissional Documentos
Cultura Documentos
Computational Journalism
Columbia Journalism School
Week 2: Filtering Algorithms
CS
Effects Data
CS
Reporting
User
CS
CS
Filtering
User
stories not covered
x
x
x
x
x
x
x
filtering
User
Each day, the Associated Press publishes:
Both together:
Collaborative topic models
Newsblaster
System Description
Cluster Cluster
Scrape Summarize
events topics
Scrape
(At least it’s simple. This was 2002. How often does this work
now?)
Text extraction from HTML
Now multiple services/apis to do this, e.g. readability.com
Cluster Events
Cluster Events
Surprise!
• Partitioning
o keep adjusting clusters until convergence
o e.g. K-means
• Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches
• Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
But news is an on-line problem...
Articles arrive one at a time, and must be clustered
immediately.
Greedy algorithm.
Single pass clustering
Categories:
U.S., World, Finance, Science and Technology, Entertainment, Sports.
“finance” category
Cluster summarization
Problem: given a set of documents, write a sentence
summarizing them.
H
V = W
LDA models a topic as a distribution over all the words in the corpus.
In each topic, some words are more likely, some are less likely.
LDA Plate Notation
topics in doc
topic topic for word words in topics word
word in doc
concentration concentration
parameter parameter
for d=1..n
for i=1..len[d]
topic_words[topics[d][i]][word[d][i]] += 1
for j=1..k
normalize topic_words[j]
topics -> doc_topics
doc_topics[*][*] = a
for d=1..n
for i=1..len[d]
doc_topics[d][topics[d][i]] +=1
for d=1..n
normalize doc_topics[d]
Update topics
// for each word in document, sample a new topic
for d=1..n
for i=1..len[d]
w = word[d][i]
for t=1..k
p[t] = doc_topics[d][j] * topic_words[j][w]
Up – down votes
plus time decay
Reddit Comment Ranking (new)
N=16
v = 11
p = 11/16 = 0.6875
n=3
v’ = 1
p’ = 1/3 = 0.333
p’ = 0.75
p = 0.1875
p’ = 0.333
p = 0.6875
Cosine similarity!
Other distance measures
“adjusted cosine similarity”
user rating
r
of item
λu u
i users
variation in
user topics topics for user
New York Times recommender
Combining collaborative filtering
and topic modeling
Content modeling - LDA
topics in doc
topic topic for word words in topics word
word in doc
concentration concentration
parameter parameter
user rating
weight of user topics in doc of doc
selections (collaborative)
variation in
per-user topics topics for user
content only
content +
social