Escolar Documentos
Profissional Documentos
Cultura Documentos
grimtrigger
12/8/2010
Contents
Introduction ...................................................................................................................................3
Data ..............................................................................................................................................4
Descriptive Statistics .................................................................................................................4
Models ..........................................................................................................................................5
Model 1......................................................................................................................................5
Model 2......................................................................................................................................5
Model 3......................................................................................................................................5
Model 4......................................................................................................................................5
Model 5......................................................................................................................................5
Model 6......................................................................................................................................6
Empirical Results ..........................................................................................................................6
Model 1......................................................................................................................................6
Model 2......................................................................................................................................7
Model 3......................................................................................................................................8
Model 4......................................................................................................................................8
Model 5......................................................................................................................................9
Model 6......................................................................................................................................9
Discussion of variables across models ...................................................................................10
Conclusion ..................................................................................................................................11
Summary .................................................................................................................................11
Further Research ....................................................................................................................11
Limitations ...............................................................................................................................11
Introduction
Social media is a radical new way for individuals to reach online content. It gives the user
access to millions of sources, and gives non-traditional sources a level playing field with
traditional sources. Understanding how readers digest these sources is important to content
creators and special interests seeking to promote specific content.
This paper looks into one site in particular, Reddit.com. Reddit users, referred to as “Redditors”,
can give submissions either positive or negative votes. The sum of these votes is referred to as
“Karma” and reflects the popularity of such posts. It is important to note that Karma begins at 1
and cannot drop below 0. The most popular posts make it to the front page, where they receive
they receive the most attention.
Submissions can take many forms, specifically: articles, blogs, self-posts, images, audio/visual
and other. Most of these are self-explanatory but self-posts refer to simple text submissions by
Redditors. They do not link outside of Reddit.com, and are similar to a blog post with
Reddit.com as the blog host. These different types of content are likely to receive different levels
of Karma since they are digested differently.
Reddit also has a subReddit system. SubReddits are communities organized around ideologies,
interests, or lifestyles. Users who identify with certain subReddits have an option to subscribe,
and will have their personalized-front page contain submissions to that subReddit. SubReddits
are also biased towards articles which reflect their idealogy, so a liberal article posted to
/r/conservative can expect very little Karma.
The size of the subReddit to which an item was submitted is a likely predictor of that
submission’s Karma score for two reasons. Firstly, the numbers of readers in that subReddit
reflect the level of interest in that subject matter. Secondly, the number of readers in a
subReddit is an indicator of how much facetime a new submission will get before it is bumped
off. In other words, there is a tradeoff between submitting to a large subReddit with a lot of
interest but too many voices, and a small subReddit with little interest but each voice is well
heard. In between these two extremes, there is likely a sweet spot which is most fertile for
Karma.
It is important to note an anomaly in the subReddit system: the main Reddit.com subReddit.
This subReddit has over 33,000 readers but has no discernable special interest. It is a vestigial
part of Reddit from before the subReddit system. I mark in my data submissions to this
subReddit to see if it affects Karma.
The most important predictor of Karma is immeasurable: engagement. Some posts are more
popular simply because they better catch or hold the user’s attention. I use number of
comments as an imperfect proxy for engagement. The rational for this is that commenting takes
more effort than simply voting. Commenting on a post indicates that the reader has made
enough of a connection to the submission to invest a response.
Data
A sample of 200 submissions was randomly selected from the 100,000,000 submissions made
to Reddit prior to submission number 24274635, which was made on December 5th, 2010.
These 100,000,000 only include submissions made after Reddit’s subreddit feature was brought
out of beta-mode in 2008. Data was obtained by randomly generating 200 numbers within the
range, converting these random numbers to base-36, and appending it to the url:
www.reddit.com/comments/ . 51 of these 200 observations were deemed as spam and not
included in the data.
Karma
Popularity ranking
Comments
Number of comments
Readers
Number of Redditors subscribed to that specific subreddit as of December 5th, 2010
Main
A binary variable; 1 if it was submitted to the “Reddit.com” subreddit, 0 if it was not
Type
Article, Blog, Self, AV, Image, Other are mutually exclusive binary variables which tell what type
of submission it was; AV is short for Audio/Visual
Descriptors
Political and NSFW are binary variables that describe the link; NSFW is short for Not Safe For
Work and usually refers to pornographic submissions
Descriptive Statistics
Mean Median Max Min Std dev
Karma 11.39 1 397 0 46.12
Comments 6.36 0 205 0 22.26
Readers 232478.2 334022 437813 10 19037.1
Main .26 0 1 0 .44
Article .31 0 1 0 .46
Blog .13 0 1 0 .34
Self .20 0 1 0 .32
AV .13 0 1 0 .33
Image .19 0 1 0 .40
Other .05 0 1 0 .22
Political .12 0 1 0 .32
NSFW .03 0 1 0 .18
Models
Model 1
2
LOG(KARMA+1) = β1LOG(COMMENTS+1) + β2LOG(READERS) + β3LOG(READERS) + β4MAIN + β5ARTICLE + β6BLOG +
β7SELF + β8IMAGE + β9AV + β10POLITICAL + β11NSFW
This is the basic model I will be testing. It gives a logarithmic interpretation to the quantitative
variables since they have exponential distributions. It is regressed through the origin since it can
be assumed that if all the dependent variables evaluate to 0, LOG(KARMA+1) should be 0.
Model 2
2
LOG(KARMA+1) = β1ABS(SELF-1)*LOG(COMMENTS+1) + β2LOG(READERS) + β3LOG(READERS) + β4MAIN + β5ARTICLE
+ β6BLOG + β7SELF + β8IMAGE + β9AV + β10POLITICAL + β11NSFW
This is similar to model 1, but only includes COMMENTS as a dependent variable for non-
SELF submissions. This model recognizes that self-posts are often have more comments, not
because they are more engaging, but they are asking a question. Karma is ma
Model 3
2
LOG(KARMA+1) = β2LOG(READERS) + β3LOG(READERS) + β4MAIN + β5ARTICLE + β6BLOG + β7SELF + β8IMAGE + β9AV
+ β10POLITICAL + β11NSFW
This model does not include COMMENTS. It recognizes that COMMENTS is an imperfect
measure of how engaging a submission is, and it may be better to leave it out all together.
Model 4
2
LOG(KARMA+1) = β1ABS(SELF-1)*LOG(COMMENTS+1) + β2LOG(READERS) + β3LOG(READERS) + β4MAIN + β5ARTICLE
+ β6BLOG + β7SELF + β8IMAGE + β9AV + β10POLITICAL + β11NSFW + C
This is the same as Model 1 but includes an intercept. It goes against the idea that if all the
dependent variable evaluate to 0, then LOG(KARMA+1) should be 0, however it grants the
model more flexibility.
Model 5
2
LOG(KARMA+1) = β1ABS(SELF-1)*LOG(COMMENTS+1) + β2LOG(READERS) + β3LOG(READERS) + β4MAIN + β5ARTICLE
+ β6BLOG + β7SELF + β8IMAGE + β9AV + β10POLITICAL + β11NSFW + C
This is the same as Model 2 but includes an intercept. It goes against the idea that if all the
dependent variable evaluate to 0, then LOG(KARMA+1) should be 0, however it grants the
model more flexibility.
Model 6
2
LOG(KARMA+1) = β2LOG(READERS) + β3LOG(READERS) + β4MAIN + β5ARTICLE + β6BLOG + β7SELF + β8IMAGE + β9AV
+ β10POLITICAL + β11NSFW + C
This is the same as Model 3 but includes an intercept. It goes against the idea that if all the
dependent variable evaluate to 0, then LOG(KARMA+1) should be 0, however it grants the
model more flexibility.
Empirical Results
Model 1
Dependent Variable: LOG(KARMA+1)
The R2 and adjusted R2 in the model are promising, explaining 63% and 60% of the variance
respectively. But the absolute t-statistics for the non binary variables, LOG(COMMENTS+1),
LOG(READERS), and LOG(READERS)^2 are large. The absolute t-statistics for MAIN and
ARTICLE are smaller, but would still be rejected at the 5% level.
Although the explanatory power of the model is still in question, it does seem to match the
theory. LOG(COMMENTS+1) has a positive coefficient. The positive and negative coefficients
of LOG(READERS) and LOG(READERS)^2 form a downward facing parabola that matches the
theory.
One thing that does not make sense is the positive coefficient of MAIN. However, it does not
necessarily validate the tossing of the model.
LOG(KARMA+1) is maximized when a subReddit has 646,711 readers. This is, worryingly,
above the maximum amount of READERS observed.
Model 2
The R2 and adjusted R2 took a hit. While the t-statistics of LOG(READERS) and
LOG(READERS)^2 became slightly more promising, they would still be rejected at the 1% level.
This model has less explanatory power that Model 1 but partially solves high t-statistic of
LOG(COMMENTS+1).
One thing that does not make sense is the positive coefficient of MAIN. However, it does not
necessarily validate the tossing of the model.
LOG(KARMA+1) is maximized when a subReddit has 1,260,155 readers. Once again, this is
alarmingly high.
Model 3
Dependent Variable: LOG(KARMA+1)
Compared to the previous two models, this one seems much less able to explain the variation in
LOG(KARMA+1). It seems clear that COMMENTS should be included in the model in one way
or another. This is also the first of the models to give MAIN a negative coefficient.
LOG(KARMA+1) is maximized when a subReddit has 9,935,701 readers. Once again, this is
alarmingly high.
Model 4
Dependent Variable: LOG(KARMA+1)
Compared to its intercept-lacking counterpart, Model 1, this Model explains the variance in the
model slightly better. On another positive note, all the variables except LOG(COMMENTS) and
C would not be rejected at the 5% level. The addition of an intercept increases the explanatory
power of Model 1.
However, the large coefficient of C, however, runs contrary to theory. The positive coefficient of
MAIN also does not make sense. Even more troubling, both LOG(READERS) and
LOG(READERS)^2 have negative coefficients, meaning a subReddit of 1 readers is most fertile
for Karma. This is obviously untrue, and I feel it is enough to throw this model out.
Model 5
Dependent Variable: LOG(KARMA+1)
Relative to Model 2, which lacks an intercept, this Model explains about the same amount of
variance. The t-statistics for all variables other than ABS(MAIN-1)*ABS(SELF-
1)*LOG(COMMENTS+1) is promising. And although including an intercept runs contrary to
theory, the intercept is relatively small.
LOG(KARMA+1) is maximized when a subReddit has 86,250 readers. This is the first such
value to make practical sense.
Model 6
The addition of an intercept to model 3 barely changes the explanatory power of the model.
However, the absolutely t-statistic of each individual variable is impressively small with only
NSFW being rejected at the 10% level. Also promising is that even though including a intercept
runs contrary to theory, the coefficient is small. The negative coefficient of MAIN is also
promising.
LOG(KARMA+1) is maximized when a subReddit has 7,119,007 readers. Once again, this is
alarmingly high.
Discussion of variables across models
The inclusion of LOG(COMMENTS+1) in some form consistently increased the explanatory
power of the model, however its removal increased the probability that the other variables were
part of the model. Model 5 seems to exemplify the best tradeoff, where comments relate to
Karma only for non-self posts.
All but Model 4 showed that there is a number of readers where Karma is maximized at some
point. However, Model 5 was the only one to give a maximum number of readers within the
range of readers observed. 86,250 is the magic number of readers, according to Model 5.
MAIN was only negative in 2 models, which does not include Model 5 which seems the most
promising. In Model 5 however, MAIN would be rejected at the 5% level. More testing is needed
to see how submitting to the Reddit.com subReddit effects Karma.
ARTICLE and IMAGE have positive coefficients in every models, and with comfortable t-
statistics. It seems conclusive that articles and images receive more Karma than their
counterparts.
AV and SELF sometimes have positive, and sometimes have negative coefficients. In Model 5,
AV has a negative effect and SELF has a negative effect. However, more testing is needed to
verify.
BLOG was the only variable to be negative in every single model. Furthermore, none of the
models would reject the effect of BLOG at the 5% level.
POLITICAL has a consistently positive coefficient, while NSFW has a consistently negative
coefficient.
Conclusion
Summary
My analysis shows that articles, images, and submissions with a political spin lend themselves
to more Karma. Blogs and pornography, however, generate less Karma. Self posts and
audio/visual submissions are somewhere in between.
There is also a sweet spot in the size of a subReddit. 86,250 is a rough estimate of where that
sweet spot may be. SubReddits smaller than this don’t display the numbers needed to maximize
Karma, while subReddits larger than this crowd it out.
There is a lack of evidence to conclude that comments are a good predictor of Karma.
Further Research
The use of comments as a proxy for engagement seemed to be the most notable flaw in my
models. If a better proxy could be found, better models could be created.
Limitations
The sample size was unfortunately small, 149 not including spam posts. A larger sample size
would greatly increase the validity of the model as well as better expose variable interactions.
Another limitation was the blurry line between the submission categories, particularly between
articles and blogs. It is hard to decipher what is a blog when established sources hire bloggers
to create content, and individual bloggers attempt to present themselves as established
sources.
Another difficult question is what would be considered political. Is a news story which is not
overtly editorialized but slightly biased considered political? If the bias wasn’t obvious at a
cursory glance, I wouldn’t mark it as political. However, a deeper reading might reveal it was so.