Você está na página 1de 75

Frontiers of

Computational Journalism
Columbia Journalism School
Week 2: Filtering Algorithms

September 15, 2017


Journalism as a cycle

CS

Effects Data

CS
Reporting

User
CS
CS
Filtering
User
stories not covered

x
x
x
x

x
x
x

filtering
User
Each day, the Associated Press publishes:

~10,000 text stories


~3,000 photographs
~500 videos
+ radio, interactive…
More video on YouTube than produced by TV networks
during entire 20th century.
Google now indexes more web pages
than there are people in the world

400,000,000 tweets per day

estimated 130,000,000 books ever published


10,000 legally-required reports filed by U.S. public
companies every day
All New York Times
articles ever =
0.06 terabytes
(13 million stories,
5k per story)
It’s not information overload, it’s filter failure
- Clay Shirky
This class
Filtering by content
Newsblaster
Topic models

Filtering by user interaction


Reddit comment ranking
User-item recommendation

Both together:
Collaborative topic models
Newsblaster
System Description

Cluster Cluster
Scrape Summarize
events topics
Scrape

Handcrafted list of source URLs (news front pages) and links


followed to depth 4

Then extract the text of each article


Text extraction from HTML
Ideal world: HTML5 “article” tags

“The article element represents a component of a page that


consists of a self-contained composition in a document, page,
application, or site and that is intended to be independently
distributable or reusable, e.g. in syndication.”
- W3C Specification
Text extraction from HTML
Newsblaster paper:

“For each page examined, if the amount of text in the


largest cell of the page (after stripping tags and links) is
greater than some particular constant (currently 512
characters), it is assumed to be a news article, and this text
is extracted.”

(At least it’s simple. This was 2002. How often does this work
now?)
Text extraction from HTML
Now multiple services/apis to do this, e.g. readability.com
Cluster Events
Cluster Events

Surprise!

• encode articles into feature vectors


• cosine distance function
• hierarchical clustering algorithm
Different clustering algorithms

• Partitioning
o keep adjusting clusters until convergence
o e.g. K-means

• Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches

• Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
But news is an on-line problem...
Articles arrive one at a time, and must be clustered
immediately.

Can’t look forward in time, can’t go back and reassign.

Greedy algorithm.
Single pass clustering

put first story in its own cluster


repeat
get next story S
look for cluster C with distance < T
if found
put S in C
else
put S in new cluster
Now sort events into categories

Categories:
U.S., World, Finance, Science and Technology, Entertainment, Sports.

Primitive operation: what topic is this story in?


TF-IDF, again
Each category has pre-assigned TF-IDF coordinate.
Story category = closest point.
“world”category
latest story

“finance” category
Cluster summarization
Problem: given a set of documents, write a sentence
summarizing them.

Active research area. See references in Newsblaster paper,


and more recent techniques.
Is Newsblaster really a filter?
After all, it shows “all” the news...

Differences with Google News?


Topic Modeling – Matrix Techniques
Problem Statement

Can the computer tell us the “topics” in a document set?


Can the computer organize the documents by “topic”?

Note: TF-IDF tells us the topics of a single document, but


here we want topics of an entire document set.
Simplest possible technique
Sum TF-IDF scores for each word across entire document
set, choose top ranking words.

This is how Overview generates cluster descriptions.


Topic Modeling Algorithms
Basic idea: reduce dimensionality of document vector
space, so each dimension is a topic.

Each document is then a vector of topic weights. We want


to figure out what dimensions and weights give a good
approximation of the full set of words in each document.

Many variants: LSI, PLSI, LDA, NMF


Matrix Factorization

Approximate term-document matrix V as product of


two lower rank matrixes

H
V = W

m docs by n terms m docs by r "topics" r "topics" by n terms


Matrix Factorization
A "topic" is a group of words that occur together.

words in this topic

topics in this document


Non-negative Matrix Factorization
All elements of document coordinate matrix W and topic
matrix H must be >= 0

Simple iterative algorithm to compute.

Still have to choose number of topics r


Probabilistic Topic Modeling
Latent Dirichlet Allocation
Imagine that each document is written by someone going
through the following process:

1. For each doc d, choose mixture of topics p(z|d)


2. For each word w in d, choose a topic z from p(z|d)
3. Then choose word from p(w|z)

A document has a distribution of topics.


Each topic is a distribution of words.
LDA tries to find these two sets of distributions.
"Documents"

LDA models each document as a distribution over topics. Each


word belongs to a single topic.
"Topics"

LDA models a topic as a distribution over all the words in the corpus.
In each topic, some words are more likely, some are less likely.
LDA Plate Notation
topics in doc
topic topic for word words in topics word
word in doc
concentration concentration
parameter parameter

N words D docs K topics


in doc
Computing LDA
Inputs:
word[d][i] document words
k # topics
a doc topic concentration
b topic word concentration
Also:
n # docs
len[d] # words in document
v vocabulary size
Computing LDA
Outputs:

topics[n][i] doc/word topic assignments


topic_words[k][v] topic words dist
doc_topics[n][k] document topics dist
topics -> topic_words
topic_words[*][*] = b

for d=1..n
for i=1..len[d]
topic_words[topics[d][i]][word[d][i]] += 1

for j=1..k
normalize topic_words[j]
topics -> doc_topics
doc_topics[*][*] = a

for d=1..n
for i=1..len[d]
doc_topics[d][topics[d][i]] +=1

for d=1..n
normalize doc_topics[d]
Update topics
// for each word in document, sample a new topic
for d=1..n
for i=1..len[d]
w = word[d][i]
for t=1..k
p[t] = doc_topics[d][j] * topic_words[j][w]

topics[d][i] = sample from p[t]


Dimensionality reduction
Output of NMF and LDA is a vector of much lower
dimension for each document. ("Document
coordinates in topic space.")

Dimensions are “concepts” or “topics” instead of


words.

Can measure cosine distance, cluster, etc. in this new


space.
Comment Ranking
Filtering Comments

Thousands of comments, what are the “good” ones?


Comment voting

Problem: putting comments with most votes at top doesn’t


work. Why?
Reddit Comment Ranking (old)

Up – down votes
plus time decay
Reddit Comment Ranking (new)

N=16
v = 11
p = 11/16 = 0.6875

Hypothetically, suppose all users voted on the comment, and v


out of N up-voted. Then we could sort by proportion p = v/N of
upvotes.
Reddit Comment Ranking

n=3
v’ = 1
p’ = 1/3 = 0.333

Actually, only n users out of N vote, giving an observed


approximate proportion p’ = v’/n
Reddit Comment Ranking

p’ = 0.75
p = 0.1875

p’ = 0.333
p = 0.6875

Limited sampling can rank votes wrong when we don’t have


enough data.
Random error in sampling
If we observe p’ upvotes from n random users, what is the
distribution of the true proportion p?

Distribution of p’ when p=0.5


Confidence interval
1-𝛼 probability that the true value p will lie within the
central region (when sampled assuming p=p’)
Rank comments by lower bound
of confidence interval
Analytic solution for confidence interval, known as “Wilson score”

p’ = observed proportion of upvotes


n = how many people voted
zα= how certain do we want to be before we assume that p’ is
“close” to true p
User-item Recommendation
User-item matrix

Stores “rating” of each user for each item. Could also


be binary variable that says whether user clicked,
liked, starred, shared, purchased...
User-item matrix
• No content analysis. We know nothing about what is “in” each
item.
• Typically very sparse – a user hasn’t watched even 1% of all
movies.
• Filtering problem is guessing “unknown” entry in matrix. High
guessed values are things user would want to see.
Filtering process
How to guess unknown rating?

Basic idea: suggest “similar” items.

Similar items are rated in a similar way by many different


users.

Remember, “rating” could be a click, a like, a purchase.


o “Users who bought A also bought B...”
o “Users who clicked A also clicked B...”
o “Users who shared A also shared B...”
Similar items
Item similarity

Cosine similarity!
Other distance measures
“adjusted cosine similarity”

Subtracts average rating for each user, to compensate for general


enthusiasm (“most movies suck” vs. “most movies are great”)
Generating a recommendation

Weighted average of item ratings by their similarity.


Matrix factorization recommender
Matrix factorization recommender

Note: only sum over observed ratings rij.


Matrix factorization plate model
variation in
topics for item
item topics
j items
λv v

user rating
r
of item

λu u
i users
variation in
user topics topics for user
New York Times recommender
Combining collaborative filtering
and topic modeling
Content modeling - LDA
topics in doc
topic topic for word words in topics word
word in doc
concentration concentration
parameter parameter

N words D docs K topics


in doc
Collaborative Topic Modeling
topic topics in doc topic for word word in doc K topics
concentration (content)

user rating
weight of user topics in doc of doc
selections (collaborative)

variation in
per-user topics topics for user
content only

content +
social

Você também pode gostar