Você está na página 1de 10

Assignment 3: Data Analysis

A Look Into Reddits Star Dish


Spring 2017

Team
Sindhu Babu, ssb257
Frances Coronel, fc333

Abstract
Since the launch of Reddit in June 2005, the site has become the 7th most visited in
the U.S. (as of April 20171), and its users have posted billions of comments.
Those comments are lled with abbreviations, internet memes and slang, much like
the rest of the web, and collectively they form a trove of data about how people use
language online2.
And yet only 4 years ago, the New Yorker reported on The Psychology of Online
Comments and how several famous websites like Reuters, Popular Science and the
Chicago Sun-Times were turning o comments.
The editors behind these changes argued that the comments, particularly
anonymous ones, undermined the integrity of science and lead to a culture of
aggression and mockery that hinders substantive discourse.
Our study aims to examine how a website like Reddit has survived by analyzing the
comments from the month of May 2015 and reviewing which emotional and
communication styles garner the most responses.

1
List of Most Popular Websites, Wikipedia
2
How the Internet Talks, FiveThirtyEight

In other words, what components make up Reddits secret dish of comments


and what allows them as a whole to succeed in a digital world where many
platforms fail to be regulate such discussion systems?

Hypotheses
By analyzing comments on Reddit from May 2015, we hypothesize the following.

Sentiment Analysis
What kind of sentiment in comments drives the highest reply rates? Passive,
assertive, aggressive, or sarcastic sentiment?
There will be a positive correlation between response rate and level of aggression.
In other words, the most aggressive comments measured by overall sentiment will
have the highest response rate.
This hypothesis focuses on the role of group identication/norms contributing to
polarization.

Meforming versus Informing


What kind of information style drives the highest reply rates? Meforming or
informing?
There will be a positive correlation between response rate and meforming.
In other words, the comments that lean towards meforming will have the highest
response rate.
This hypothesis focuses on the role that self reference or meforming plays in
comments found on Reddit.

Connective Media Theories


We created these hypotheses based o several connective media theories.

Emotional Contagion
3

Connective media is an emerging eld that studies many aspects of social media, in
particular the capacity to spread emotions quickly throughout the online world, or
emotional contagion3.

Group Polarization
Group polarization, a side eect of emotional contagion, is a common occurrence
on the Internet. Group polarization is the phenomenon that individuals tend to
endorse a more extreme position in the direction already favored by the group4.

Meforming versus Informing


Meformers are set of users newly dened by research done in 2010 on social
media awareness streams5. They are described as users that typically post
messages relating to themselves or their thoughts while informers in contrast
post messages that are informational in nature.

Dataset
Our dataset was open-sourced by Reddit and is distributed via Kaggle6.
It contains a portion of the comments made on Reddit during May 2015 (8GB
compressed, 30GB uncompressed) which is again still a small portion of the overall
publicly available comments Reddit released overall (1+ terabyte).
The database provided on Kaggle has one table called May2015with the following
data elds.
created_utc
ups
subreddit_id
link_id
name
score_hidden
author_flair_css_class
author_flair_text

3
Measuring Emotional Contagion in Social Media, Ferrara & Yang, 2015
4
Group Polarization: A Critical Review and Meta-Analysis, Isenberg, 1986
5
Is it really about me? Message Content in Social Awareness Streams, Naaman & Boase & Lai, 2010
6
May 2015 Reddit Comments - Kaggle
4

subreddit
id
removal_reason
gilded
downs
archived
author
score
retrieved_on
body
distinguished
edited
controversiality
parent_id

We focused primarily on the data elds score(number of upvotes) and body(the


text of the comment).

Analysis
We were able to easily analyze the data without having to manually download any
large dataset and performing data cleanup since the dataset is open-sourced on
Kaggle.
It was simply a matter of creating a new kernel via Kaggle.
This code was then shared on GitHub (see Source Code section).
We used the language SQL to analyze the data and generate our data visualizations.

Sentiment Analysis
What kind of sentiment in comments drive the highest reply rates? Passive,
assertive, aggressive, or sarcastic sentiment?
5


This graph showcases the various sentiments used in comments (aggressive,
assertive, passive, and sarcastic) on the X-axis while the number of comments for
each sentiment is shown on the Y-axis.
The labels represent the ranking in terms of number of upvotes for each sentiment
and is not related to the number of comments.
To dierentiate each sentiment, we identied keywords that are were uniquely
representative of each sentiment. For example, aggressive comments would have a
lot of curse words as keywords.
Aggressive comments were actually the most likely to have the highest number of
upvotes with a ranking score of 6.45which is ~11%higher than the second best
(assertive).
In turn, assertive comments had the highest reply rates with nearly 90,000
comments which is 200%better (2x) than the next best (aggressive).
Surprisingly, comments categorized as sarcastic by the Reddit community were the
least likely to be upvoted or commented.
It should be noted that we are not counting the replies to every post but instead
trying to capture the overall sentiment of each subreddit.
6

Based o these results, our hypothesis on the positive correlation between


aggression and reply rates is rejected. However, it is clear that there is in fact
a positive correlation between aggression and the number of upvotes.

Meforming versus Informing


What kind of information style drives the highest reply rates? Meforming or
informing?


This graph showcases the two contrasting information styles that can be used in
comments (meforming versus informing) on the X-axis while the number of
comments for each communication style is shown on the Y-axis.
The labels represent the ranking in terms of number of upvotes for each
communication style and is not related to the number of comments.
To dierentiate between the two, there were keywords identied that would
denote meforming while all other comments were identied as informing.
Meforming comments had the highest number of upvotes with a ranking score of
5.68 which is only ~1%higher compared to informing.
Informing comments had the highest reply rates with over 1millcomments which
is a staggering ~500%higher than meforming.
Meforming fared much worse when it came to reply rates but surprisingly was
slightly higher when it came to number of upvotes.
7

It should be noted that we are not counting the replies to every post but instead
trying to capture the overall information style of each subreddit.
Based o these results, interestingly enough, our hypothesis on the positive
correlation between meforming and reply rates is rejected. However, it is
clear that there is in fact a positive correlation between meforming and the
number of upvotes.

Source Code
You can also view this code on GitHub.


10

Conclusion
Based o our data analysis for comments on Reddit during the month of May 2015,
it can be concluded that a user on Reddit is more likely to have a higher reply rate
for a comment that is assertive and informing.
In turn, it can also be concluded that a user on Reddit is less likely to have a higher
reply rate for a comment that is sarcastic and meforming.

Você também pode gostar