Escolar Documentos
Profissional Documentos
Cultura Documentos
Shaomei Wu
Jake M. Hofman
sw475@cornell.edu
Winter A. Mason
hofman@yahoo-inc.com
winteram@yahooinc.com
ABSTRACT
We study several longstanding questions in media communications research, in the context of the microblogging service
Twitter, regarding the production, ow, and consumption of
information. To do so, we exploit a recently introduced feature of Twitter known as lists to distinguish between elite
usersby which we mean celebrities, bloggers, and representatives of media outlets and other formal organizationsand
ordinary users. Based on this classication, we nd a striking concentration of attention on Twitter, in that roughly
50% of URLs consumed are generated by just 20K elite
users, where the media produces the most information, but
celebrities are the most followed. We also nd signicant
homophily within categories: celebrities listen to celebrities,
while bloggers listen to bloggers etc; however, bloggers in
general rebroadcast more information than the other categories. Next we re-examine the classical two-step ow theory of communications, nding considerable support for it
on Twitter. Third, we nd that URLs broadcast by dierent
categories of users or containing dierent types of content
exhibit systematically dierent lifespans. And nally, we examine the attention paid by the dierent user categories to
dierent news topics.
General Terms
two-step ow, communications, classication
Keywords
Communication networks, Twitter, information ow
1. INTRODUCTION
A longstanding objective of media communications research is encapsulated by what is known as Lasswells maxim:
Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use,
and personal use by others.
WWW 2011, March 28April 1, 2011, Hyderabad, India.
ACM 978-1-4503-0637-9/11/03.
Duncan J. Watts
djw@yahoo-inc.com
who says what to whom in what channel with what effect [12], so-named for one of the pioneers of the eld,
Harold Lasswell. Although simple to state, Laswells maxim
has proven dicult to answer in the more-than 60 years
since he stated it, in part because it is generally dicult to
observe information ows in large populations, and in part
because dierent channels have very dierent attributes and
eects. As a result, theories of communications have tended
to focus either on mass communication, dened as oneway message transmissions from one source to a large, relatively undierentiated and anonymous audience, or on interpersonal communication, meaning a two-way message
exchange between two or more individuals. [16].
Correspondingly, debates among communication theorists
have tended to revolve around the relative importance of
these two putative modes of communication. For example, whereas early theories such as the hypodermic needle
model posited that mass media exerted direct and relatively
strong eects on public opinion, mid-century researchers [13,
9, 14, 4] argued that the mass media inuenced the public only indirectly, via what they called a two-step ow of
communications, where the critical intermediate layer was
occupied by a category of media-savvy individuals called
opinion leaders. The resulting limited eects paradigm
was then subsequently challenged by a new generation of
researchers [6], who claimed that the real importance of the
mass media lay in its ability to set the agenda of public
discourse. But in recent years rising public skepticism of
mass media, along with changes in media and communication technology, have tilted conventional academic wisdom
once more in favor of interpersonal communication, which
some identify as a new era of minimal eects [2].
Recent changes in technology, however, have increasingly
undermined the validity of the mass vs. interpersonal dichotomy itself. On the one hand, over the past few decades
mass communication has experienced a proliferation of new
channels, including cable television, satellite radio, specialist book and magazine publishers, and of course an array
of web-based media such as sponsored blogs, online communities, and social news sites. Correspondingly, the traditional mass audience once associated with, say, network
television has fragmented into many smaller audiences, each
of which increasingly selects the information to which it is
exposed, and in some cases generates the information itself [15]. Meanwhile, in the opposite direction interpersonal
communication has become increasingly amplied through
personal blogs, email lists, and social networking sites to
2.
RELATED WORK
3.
3.1
from
http://dev.twitter.com/doc/get/statuses/rehose
Naturally, this restriction also has downsides, in particular
that some users may be more likely to include URLs in their
tweets than others, and thus will appear to be relatively
more active and/or have more impact than if we were instead
to consider all tweets. For our purposes, however, we believe
that the practical advantages of the restriction outweigh the
potential for bias.
3
3.3.1
The rst method for identifying elite users employed snowball sampling. For each category, we chose a number u0 of
seed users that were highly representative of the desired category and appeared on many category-related lists. For each
of the four categories above, the following seeds were chosen:
Celebrities: Barack Obama, Lady Gaga, Paris Hilton
Media: CNN, New York Times
Organizations: Amnesty International, World Wildlife
Foundation, Yahoo! Inc., Whole Foods
4
The Twitter API allows only 20K calls per hour, where at
most 20 lists can be retrieved for each API call. Under the
modest assumption of 40M users, where each user is included
on at most 20 lists, this would require roughly 11 weeks.
Clearly this time could be reduced by deploying multiple
accounts, but it also likely underestimates the real time quite
signicantly, as many users appear on many more than 20
lists (e.g. Lady Gaga appears on nearly 140,000).
u0
l0
u1
l1
u2
l2
Figure 1:
Method
Snowball Sample
# of users
% of users
82,770
15.8%
216,010
41.2%
97,853
18.7%
127,483
24.3%
524,116
100%
Activity Sample
# of users
% of users
14,778
13.0%
40,186
35.3%
14,891
13.1%
43,830
38.6%
113,685
100%
all lists associated with all users who tweeted at least once
every week for our entire observation period.
This activity-based sampling method is also clearly biased towards users who are consistently active. Importantly,
however, the bias is likely to be quite dierent from any introduced by the snowball sample; despite these dierences,
the qualitative results that follow are similar for both samples, providing evidence that our ndings are not artifacts
of the sampling procedures. This method initially yielded
750k users and 5M lists; however, after pruning the lists to
those that contained at least one of the keywords above, and
assigning users to unique categories (as described above), we
obtained a rened sample of 113,685 users, where Table 1
reports the number of users assigned to each category. We
note that the number of lists obtained by the activity sampling methods is considerably smaller than that obtained
by the snowball sample, and that bloggers are more heavily represented among the activity sample at the expense of
the other three categoriesconsistent with our claim that
the two methods introduce dierent biases. Interestingly,
however, 97,614 of the activity sample, or 85%, also appear
in the snowball sample, suggesting that the two sampling
methods identify similar populations of elite usersas indeed we conrm in the next section.
3.3.3
50
tweets received
1000
average %
20 30 40
10
4000 7000
top k
10000
1000
4000 7000
top k
10000
50
tweets received
celeb
media
org
blog
10
average %
20 30 40
celeb
media
org
blog
10
average %
20 30 40
50
friends
1000
celeb
media
org
blog
10
average %
20 30 40
50
friends
celeb
media
org
blog
4000 7000
top k
10000
1000
4000 7000
top k
10000
followed by bloggers, organizations, and celebrities. Ordinary users originate on average only about 6 URLs each,
compared with over 1,000 for media users. In the rest of
this paper, therefore, when we talk about celebrity, media, organization, blog, we refer the top 5K users drawn
from the snowball sample listed as celebrity, media, organization, blog, respectively.
Table 3, which shows the top 5 users in each of the four
categories, suggests that the sampling method yields results that are consistent with our objective of identifying
users who are prominent exemplars of our target categories.
Among the celebrity list, for example, aplusk, is the handle for actor Ashton Kusher, one of the rst celebrities to
embrace Twitter and still one of the most followed users,
while the remaining celebrity usersLady Gaga, Ellen Degeneres, Oprah Winfrey, and Taylor Swift, are all household
names. In the media category, CNN Breaking News and the
New York Times are most prominent, followed by Breaking
News, Time, and Asahi, a leading Japanese daily newspaper. Among organizations, Google, Starbucks, and Twitter are obviously large and socially prominent corporations,
while JoinRed is the charity organization started by Bono of
U2, and ollehkt is the Twitter account for KT, formerly Korean Telecom. Finally, among the blogging category, Mashable and ProBlogger are both prominent US blogging sites,
while Kibe Loco and Nao Salvo are popular blogs in Brazil,
and dooce is the blog of Heather Armstrong, a widely read
mommy blogger with over 1.5M followers.
4.
The results of the previous section provide qualied support for the conventional wisdom that audiences have become increasingly fragmented. Clearly, ordinary users on
Twitter are receiving their information from many thousands of distinct sources, most of which are not traditional
media organizationseven though media outlets are by far
the most active users on Twitter, only about 15% of tweets
received by ordinary users are received directly from the
media. Equally interesting, however, is that in spite of this
fragmentation, it remains the case that 20K elite users, comprising less than 0.05% of the user population, attract almost
50% of all attention within Twitter. Thus, while attention
that was formerly restricted to mass media channels is now
Celeb
Org
Media
Blog
A retweet B
Celeb
Org
Media
Blog
Celeb
Celeb 4,334
Media 4,624
Org 1,570
Blog 3,710
# of retweets by
Media
Org Blog
1,489 1,543 5,039
40,263 7,628 32,027
2,539 18,937 11,175
6,382 5,762 99,818
aries? In addition, we may inquire whether these intermediaries, to the extent they exist, are drawn from other elite
categories or from ordinary users, as claimed by the twostep ow theory; and if the latter, in what respects they
dier from other ordinary users.
Before proceeding with this analysis, we note that there
are two ways information can pass through an intermediary
in Twitter. The rst is via retweeting, which occurs when
a users explicitly rebroadcasts a URL that he or she has received from a friend, along with an explicit acknowledgement
of the sourceeither using the ocial retweet functionality
provided by Twitter or by making use of an informal convention such as RT @user or via @user. Alternatively,
a user may tweet a URL that has previously been posted,
but without acknowledgement of a source; in this case we
assume the information was independently rediscovered and
label this a reintroduction of content. For the purposes
of studying when a user receives information directly from
the media or indirectly through an intermediary, we treat
retweets and reintroductions equivalently. If the rst occurrence of a URL in Twitter came from a media user, but a
user received the URL from another source, then that source
can be considered an intermediary, whether they are citing
the source within Twitter by retweeting the URL, or reintroducing it, having discovered the URL outside of Twitter.
To quantify the extent to which ordinary users get their
information indirectly versus directly from the media, we
sampled 1M random ordinary users6 , and for each user,
counted the number n of bit.ly URLs they had received that
had originated from one of our 5K media users, where of
the 1M total, 600K had received at least one such URL.
For each member of this 600K subset we then counted the
number n2 of these URLs that they received via non-media
friends; that is, via a two-step ow. The average fraction
n2 /n = 0.46 therefore represents the proportion of mediaoriginated content that reaches the masses via an intermediary rather than directly. As Figure 5 shows, however,
this average is somewhat misleading. In reality, the population comprises two typesthose who receive essentially
all of their media-originating information via two-step ows
and those who receive virtually all of it directly from the media. Unsurprisingly, the former type is exposed to less total
media than the latter. What is surprising, however, is that
random sample
b
# users
0 50000
150000
105
intermediaries
intermediaries
c
# of opinion leaders
104
103
102
10
0
# users
100000
10
102
103
104
105
# of twostep recipients
random sample
5.
The results in Section 4 demonstrate the elite users account for a substantial portion of all of the attention on
Twitter, but also show clear dierences in how the attention
is allocated to the dierent elite categories. It is therefore
interesting to consider what kinds of content is being shared
by these categories. Given the large number of URLs in our
observation period (260M ), and the many dierent ways one
can classify content (video vs. text, news vs. entertainment,
political news vs. sports news, etc.), classifying even a small
fraction of URLs according to content is an onerous task.
Bakshy et al. [1], for example, used Amazons Mechanical
Turk to classify a stratied sample of 1,000 URLs along a
variety of dimensions; however, this method does not scale
well to larger sample sizes.
Instead, we restricted attention to URLs originated by the
New York Times which, with over 2.5M followers, is the most
active and the second-most-followed news organization on
Twitter (after CNN Breaking News). To classify NY Times
content, we exploited a convenient feature of their format
namely that all NY Times URLs are classied in a consistent
way by the section in which they appear (e.g. U.S., World,
Sports, Science, Arts, etc.) 7 . Of the 6398 New York Times
bit.ly URLs we observed, 6370 could be successfully unshortened and assigned to one of 21 categories. Of these, however, only 9 categories had more than 100 URLs during the
observation period, one of whichNY regionwas highly
specic to the New York metropolitan area; thus we focused
our attention on the remaining 8 topical categories. Figure
7 shows the proportion of URLs from each New York Times
section retweeted or reintroduced by each category. World
7
http://www.nytimes.com/year/month/day/category/
title.html?ref=category
1. World News
first observation
of URL
2. U.S. News
0.35
last observation
of URL
0.30
0.25
0.20
0.15
0.10
0.05
0.00
3. Business
4. Sports
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
5. Health
6. Technology
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
7. Science
8. Arts
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
blog
celeb
media
org
other
blog
celeb
media
org
other
User Category
5.2
Lifespan By Category
108
106
# of URLs
youtube.com
last.fm
amazon.com
pollpigeon.com
en.wikipedia.org
bitrebels.com
mashable.com
feedproxy.google.com
ted.com
ecademy.com
imdb.com
myspace.com
facebook.com
twitter.com
google.com
flickr.com
collegehumor.com
vimeo.com
smashingmagazine.com
123greetings.com
other
celeb
media
org
blog
104
102
0
0
10
20
30
40
50
60
70
lifespan (day)
5
4
3
2
1
0
0
10
20
30
40
50
lifespan (day)
104
# of RTs
total # of occurrences
celeb
media
org
blog
103
count
60
70
RT rate =
102
10
(a) Count
(b) Percent
Figure 9: 9(a) Count and 9(b) percentage of URLs
initiated by 4 categories, with dierent lifespans
1.0
other
celeb
media
org
blog
0.8
0.6
0.4
0.2
0.0
0
10
20
30
40
50
60
70
lifespan (day)
rediscovering the same content, consistent with our interpretation above. Second, however, for URLs introduced by
elite users, the result is somewhat the oppositethat is, they
are more likely to be retweeted than reintroduced, even for
URLs that persist for weeks. Although it is unsurprising
that elite users generate more retweets than ordinary users,
the size of the dierence is nevertheless striking, and suggests that in spite of the dominant result above that content
lifespan is determined to a large extent by the type of content, the source of its origin also impacts its persistence, at
least on averagea result that is consistent with previous
ndings [1].
6.
CONCLUSIONS
In this paper, we investigated a classic problem in media communications research, captured by the rst part of
Laswells maximwho says what to whomin the context
of Twitter. In particular, we nd that although audience attention has indeed fragmented among a wider pool of content
producers than classical models of mass media, attention remains highly concentrated, where roughly 0.05% of the population accounts for almost half of all posted URLs. Within
this population of elite users, moreover, we nd that attention is highly homophilous, with celebrities following celebri-
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
7. REFERENCES
[1] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J.
Watts. Identifying inuencers on twitter. In Fourth
ACM International Conference on Web Seach and
Data Mining (WSDM), Hong Kong, 2011. ACM.
[2] W. L. Bennett and S. Iyengar. A new era of minimal
eects? the changing foundations of political
communication. Journal of Communication,
58(4):707731, 2008.
[3] M. Cha, H. Haddadi, F. Benevenuto, and K. P.
Gummad. Measuring user inuence on twitter: The
million follower fallacy. In 4th Intl AAAI Conference
on Weblogs and Social Media, Washington, DC, 2010.
[4] J. S. Coleman, E. Katz, and H. Menzel. The diusion
of an innovation among physicians. Sociometry,
20(4):253270, 1957.
[5] R. Crane and D. Sornette. Robust dynamic classes
revealed by measuring the response function of a
social system. Proceedings of the National Academy of
Sciences, 105(41):15649, 2008.
[6] T. Gitlin. Media sociology: The dominant paradigm.
Theory and Society, 6(2):205253, 1978.
[7] M. Gomez Rodriguez, J. Leskovec, and A. Krause.
Inferring networks of diusion and inuence. In