Escolar Documentos
Profissional Documentos
Cultura Documentos
Arjan Durresi
School of Science
Indiana University Purdue University at Indianapolis
Indianapolis, IN, USA
lalfanto@umail.iu.edu
School of Science
Indiana University Purdue University at Indianapolis
Indianapolis, IN, USA
durresi@cs.iupui.edu
I. I NTRODUCTION
Research life cycle involves different stages such as
planning the proposal, starting the project, collecting the data,
analyzing the data and getting the result. As we can see,
data collection plays an important stage since it answers the
research questions.
II. R ELATED
WORK
Fig. 1.
337
C. Yelp
B. Facebook
Facebook is a social Networking site that provide some
services such as posting status, posting images and making
friends. It is a great source of a big data. From this perspective,
there are different techniques to collect the data. For example,
it offers Graph API Explorer that is a great tool to generate
data by either using Graph API or FQL (Facebook Query
Language) . According to Facebook Developer page [8], Its
a low-level HTTP-based API that it can used to query data,
post new stories, upload photos and a variety of other tasks
that an app might need to do.
D. LinkedIn
LinkedIn is social networking sites related to business.
[11] stated that LinkedIn is the largest professional social
networking having more then 238M members. It offers
services for advertisers to reach out to target users by looking
at their proles and behaviors.
3456-
A. Collect by Group
We collected data from 3 groups related to stock market.
The idea is to create an application that has different classes.
One of them has been designed to retrieve all the users
338
Fig. 2.
Fig. 3.
V. R ESULT
Table I shows the corresponding information regarding
collecting data from 3 different groups in Twitter. We can
notice that the more the users the more time needed in
data collection. The reason of that is due to Twitter rate
limits regarding retrieving data, which is 350 requests per
hour according to twitter developer site. In order to speedup
the process, several tokens should recreated in order to
run several requests in parallel. Creating several tokens
helped in decreasing the duration required to collect the data
signicantly. As it can be seen from the table of collecting
around 1,700,000 user accounts in 2 months vs collecting
around 287,720 users in 2 months. Table II shows comparison
of data collection from different social networking sites based
on some criteria. The comparison is based on the work related
339
TABLE III
C OLLECTING DATA FROM DIFFERENT S OCIAL N ETWORKING S ITES
TABLE I
C OLLECTING DATA FROM DIFFERENT GROUP IN T WITTER
Group
Number
of Users
Number
of Tweets
Approximate End
time
date
of
needed for
collection
collection
StockTwits
287,720
3,282,373
2 months
July 2013
FinancialTimes
1,700,000
1,576,889
2 months
October
2013
MarketWatch
928,066
9,909,333
Month
February
2014
Factor
By Group
By Filtering
Time needed
month or more
(depending
on
the number of
users)
Few minutes
Network Size
Large
Small
Data Relevance
TABLE IV
C OLLECTING DATA BY F ILTERING IN M AY 5 TH - 9 TH
TABLE II
C OLLECTING DATA FROM DIFFERENT S OCIAL N ETWORKING S ITES
Criteria
Yelp
Library/SDK
Twitter4j
Facebook
SDK
scribe
scribe
Programming
Language
java,
python
php,
javascript
java
java
Authentication
and Authorization
OAuth
OAuth
OAuth
OAuth
Keyword
Number of tweets
$YHOO
299
$P
299
$EBAY
399
$BBRY
664
$TWTR
99
$V
99
$YELP
997
$MDR
108
$SPLK
237
$HIMX
158
$LBTYA
245
$FEYE
397
$AMZN
898
$NKE
169
$LNKD
299
TABLE V
C OLLECTING DATA BY F ILTERING IN M AY 9 TH -13 TH
ACKNOWLEDGMENT
The authors would like to thank Twitter developers for
providing documentation that was helpful when using the
Twitter API for implementing the program needed for data
collection.
Keyword
Number of tweets
$YHOO
901
$P
99
$EBAY
335
$BBRY
670
$TWTR
1198
$V
716
$YELP
199
$MDR
63
$SPLK
80
$HIMX
283
$LBTYA
137
$FEYE
499
$AMZN
1411
$NKE
99
$LNKD
905
R EFERENCES
[1] M. A. Russell, Mining the Social Web. OReilly, 2nd ed., October 2013.
22.
340
[2] J. Y. Park, K. Daejeon, and C.-W. Chung, Swhen daily deal services
meet twitter: Understanding twitter as a daily deal marketing platform,
Proceedings of the 3rd Annual ACM Web Science Conference, 2012.
[3] Y.-C. Wang, M. Burke, and R. Kraut., Security in vehicular ad-hoc
network with identity-based cryptography approach: A survey, CHI
2013: Changing Perspectives, Paris, France, 2013.
[4] Y. Mao, W. Wei, and B. Wang, Twitter volume spikes: Analysis and
application in stock trading, Proceedings of the 7th Workshop on Social
Network Mining and Analysis, 2013.
[5] G. Comarela, V. Almeida, M. Crovella, and F. Benevenuto, Understanding factors that affect response rates in twitter, Proceedings of the 23rd
ACM conference on Hypertext and social media, pp. 123132, 2012.
[6] T. Developer, Rate limiting, https://dev.twitter.com/docs/ratelimiting/1, 2013.
[7] B. Gokulakrishnan, P. Priyanthan, T. Ragavan, N. Prasath, and A. Perera,
Opinion mining and sentiment analysis twitter data stream, Advances
in ICT for Emerging Regions (ICTer), 2012 International Conference
on, pp. 182 188, 2012.
[8] F.
Developer,
Using
the
graph
api,
https://developers.facebook.com/docs/graph-api/using-graph-api/,
2014.
[9] B. Rieder, Studying facebook via data extraction: The netvizz application, Proceedings of the 5th Annual ACM Web Science Conference,
2013.
[10] B. BROWN, Beyond recommendations: Local review web sites and
their impact, ransactions on Computer-Human Interaction (TOCHI),
vol. 19, no. 4, 2012.
[11] D. Agarwal, Computational advertising: The linkedin way, Proceedings of the 22nd ACM International Conference on Conference on
Information and Knowledge Management, vol. 19, no. 4, 2013.
341