Você está na página 1de 6

2014 International Conference on Network-Based Information Systems

Techniques for Collecting data in Social Networks


Lina Alfantoukh

Arjan Durresi

School of Science
Indiana University Purdue University at Indianapolis
Indianapolis, IN, USA
lalfanto@umail.iu.edu

School of Science
Indiana University Purdue University at Indianapolis
Indianapolis, IN, USA
durresi@cs.iupui.edu

into consideration by looking at the source whether it


is reliable or not.
Privacy: The privacy of the data must be taken into
consideration since some data are supposed to be kept
private. So, when collecting the data, the collector should
check whether he/she is allowed to obtain the data. For
example, in Twitter, some accounts are private and if the
collector accessed them without authorization, then it will
cause privacy violation.
In this paper, we mainly focus on Twitter Data collection
among other social networks data. [1] stated that Twitter
is an important phenomenon from the standpoint of its
incredibly high number of users, as well as its use as a
marketing device and emerging use as a transport layer
for third party messaging services. It offers an extensive
collection of APIs, and although you can use a lot of the
APIs without registering, its much more interesting to build
up and mine your own network.

AbstractData is a piece of information that is needed to


form useful information. Obtaining data is needed by individuals
such as researchers. From this perspective , data collection is
an important step when doing any research or experiment.
Data collection can be dened as the process of gathering and
processing the information to evaluate the outcomes and use them
for the researches. Online Social Networking sites (OSN) are one
of the best sources of data. In this paper, we are going to introduce
the benets of using the social network sites for data collection
and the different techniques that can be used. We are, also, going
to list different techniques for data collection in a project related
to Using Trust Management System in Twitter. The goal was to
collect data from Twitter. Those data include users information
and Tweets. Based on those data, a network of trust created using
the relationships among users. The way the data being collected
are different in term of efciency and being useful.
Index TermsSocial Networking; Twitter; Data Collection;
Stock Market; Facebook; Big Data;

I. I NTRODUCTION
Research life cycle involves different stages such as
planning the proposal, starting the project, collecting the data,
analyzing the data and getting the result. As we can see,
data collection plays an important stage since it answers the
research questions.

The main contributions of this work include:


Collect Data from OSN, specically from Twitter.
Demonstrate different approaches for collecting the data
from Twitter.
Evaluate the two approaches and examine the possibility
of combining two approaches.
The remaining of this paper is organized as follows: In
section II, we introduce some related works about data collection from OSN. Then, we describe different methods for
data collection based on the social networking site such as
Twitter, Facebook, LinkedIn and Yelp in III. We use the
collected data for a research work related to Using Trust
Management System in Twitter and then show the collection
techniques evaluations and results in section IV and V. Finally,
we conclude in Section VI.

Collecting data can be done by using different techniques


based on the source of the data. Some examples related to
data sources involves people interviews, surveys distributions
and Social Networking sites.
Social Networking sites provide a powerful source for data
collections since such sites store demographic information
about people, their relationships with other users in the same
site and their ranking, reviews or even rating information
depending on the nature of the sites.
There are some points that should be put into consideration
while collecting the data such as the accuracy, the reliability
and the security.
Accuracy: The accuracy of the data being collected is
very essential because non-accurate data will lead to
incorrect results or observations for the research. Thus,
accurate data will be useful to maintain the integrity of
the research.
Reliability: The reliability of the data must be taken
978-1-4799-4224-4/14 $31.00 2014 IEEE
DOI 10.1109/NBiS.2014.92

II. R ELATED

WORK

There are numerous amount of researches that used the


online social networking for data collection. For example,
there is a work related to understanding Twitter as a daily
deal marketing platform. [2] indicated that For our analysis,
we collected a sample of tweets that contain the URL link
groupon.com/deals/. We built a Twitter crawler using the
streaming Application Programming Interface (API) provided
336

by Twitter 1 , and collected the tweet stream in real-time


from September 3rd 2011 to January 31st 2012.. Using
those data was useful to show that the Daily deal sharing on
Twitter occurs most often in the morning around the middle
of the week, and the deals offered in multiple locations tend
to be shared frequently.
[3] collected data from Facebook in order to work on
using Latent Dirichlet Allocation (LDA) to identify topics
from more than half a million Facebook status updates
and determine which topics are more likely to receive
feedback, such as likes and comments.They stated that
This study demonstrates gender differences in the topics
of status updates on SNS. Adult women are more likely
to write about personal topics, while men are more likely
to write about philosophical topics. For teens, the gender
differences are muted. . They randomly sampled one
million English status updates posted by U.S. Facebook
users in June 2012. For each status update, they analyzed
post time, number of viewers, and number of comments
and likes within three days of posting. Additionally, they
collected demographic information about the poster, including
gender, age, friend count, and days since the poster registered.

Fig. 1.

Methods of Data Collection

III. C OLLECTION M ETHODS


There are different social networking sites in which we
can collect data from such as Twitter, Facebook, Yelp and
LinkedIn. In general, to obtain the data, it can be done in
different way like contacting the site administrator to get the
dataset, downloading the dataset that is created for academic
purposes or challenge. However, designing a system or
writing a program can be useful sometime. Figure 1 shows a
summary of a method of collecting data.

There is a research work, which has been proposed to


show that Trust Management System can predict indirect
trust. Data collection was needed in order to do the
experiments and the evaluation. One of the best data source
for this research was Twitter. The idea was to search for
groups in twitter that talk about stock market. It has been
intended to collect users IDs in order to use them later when
retrieving their information from twitter. This work is going
to be demonstrated in this paper as a an example of how the
data being collected.

As it has been stated before, there is an API (Application


Programming Interface) associated with each social sites in order to help the data collector to request the services form those
sites. The common procedures for those sites are installing
their corresponding libraries, obtaining the authorization and
then deciding on the platform in which the collector may use
to write the code.
A. Twitter

There is also another work about investigating Twitter


volume spikes related to S & P 500 stocks, and whether they
are useful for stock trading.It has been stated that In this
paper, we investigate Twitter volume spikes related to S &
P 500 stocks, and whether they are useful for stock trading.
Through correlation analysis, we provide insight on when
Twitter volume spikes occur and possible causes of these
spikes. [4]. Data has been collected from twitter by using
stock symbol prexed by a dollar sign.

Twitter is micoblogging service that allows the users to


post messages not exceeding the maximum length equals to
140. Twitter4j is java library for Twitter API. When creating
an application using Twitter API the developer obtains OAuth
that consists of consumer key, consumer secret, access token
and access token secret. Those are supposed to be used in
order to authorize the developer when collecting data from
twitter. The developer should have an account in Twitter
in order to use the library. Once the developer signs in the
Twitter account, he/she is able to create an application and
obtain the authorization. Then, use platform like NetBeans in
order to write the java code that connect to twitter services
through the library. After creating the application in java, the
developer should supply the authorization components and
then be able to import the twitter4j library in order to use
the required functions to retrieve the data. In this paper, there
will show an example that illustrates those steps.

[5] has addressed the problem of understanding how


users in Twitter interact with their timelines. They showed
that users prefer of interact with newer tweets and with users
that they had interacted with before.Therefore, collecting data
from Twitter was needed. It has been stated that Our dataset
contains 54,981,152 user accounts connected to each other
by 1,963,263,821 social links. Our dataset also contains all
tweets ever posted by the collected users, which consists of
1,755,925,520 tweets.

However, twitter API has a limitation corresponding to


the amount of data to retrieve per unit of time for each OAuth

337

C. Yelp

information. As it has been stated in Twitter developer OAuth


calls are permitted 350 requests per hour and are measured
against the oauth token used in the request[6]. Consequently,
it was important to create number of applications that are
the same but different in term of OAuth information. This
has been effective in order to run all the created applications
simultaneously and as a result getting all the required data in
a reasonable amount of time.

Yelp is a social network site that help people to nd local


businesses as well as write review about it as well as allowing
to rate the business and peoples reviews. Additionally, it
helps to nd events. [10] mentioned that Yelp has grown to
dominate the online review of establishments and the number
of individual reviews of businesses in the US and Europe
is over twenty-ve million Yelp API allows to developers
to access search results, which match the results on Yelp.
Authentication and Authorization is mandatory. Thus, the
developer needs to register to get an account and based on
that the authentication components can be requested in order
to use the services provided by the API.

An important point to mention is that when collecting


the data, we consider only the public ones since the private
accounts cannot be retrieved or seen publicly. It has been
stated that The users privacy issues were handled as follows.
Most tweets; on twitter are set public, and can be viewed by
any person regardless of membership to twitter. The tweets
which require following the author in order to be accessible
are termed private, and they are not reected on the public
data stream (called the public timeline) [7]

In addition to the API, the data can be collected from


Yelp based on the dataset offered by Yelp for academic
purposes. The users will request the access using the
academic email and then once the access is given , the data
can be downloaded from yelps site. The dataset is divided
into several entities such as Business, user, Review, checkin
and tip. Each entity has its own corresponding attribute. The
dataset is downloaded in json format. However, the data
les can be opened using any text editor. It is the users
responsibility to extract data from those les. Thus, the user
may need to write program to extract the data based on
specic criteria.

B. Facebook
Facebook is a social Networking site that provide some
services such as posting status, posting images and making
friends. It is a great source of a big data. From this perspective,
there are different techniques to collect the data. For example,
it offers Graph API Explorer that is a great tool to generate
data by either using Graph API or FQL (Facebook Query
Language) . According to Facebook Developer page [8], Its
a low-level HTTP-based API that it can used to query data,
post new stories, upload photos and a variety of other tasks
that an app might need to do.

D. LinkedIn
LinkedIn is social networking sites related to business.
[11] stated that LinkedIn is the largest professional social
networking having more then 238M members. It offers
services for advertisers to reach out to target users by looking
at their proles and behaviors.

The other technique to collect the data is using a tool


called Netvizz. This tool was developed by a researcher. The
goal of this tool is to help the researchers to collect data for
research purposes. The Netvizz application provides raw
data for both personal networks and pages, but provides data
perspectives not available in other tools, e.g. comment text
extraction; it also provides data for groups, a third functional
space on Facebook. [9] In order to use this tool, it is
mandatory to have an account on Facebook. When opening
the page that provides the interface for this tool, it lists
several criteria in which we select from and retrieve the data
accordingly. Here are the steps needed to extract some data:
12-

3456-

There is LinkedIn API that is useful for data collection. The


user has to register to get an account and then create an
application. After the registration stage, the developer will
receive an authentication information including API Key and
Secret Key. To make API calls, the user needs to get access
token.
IV. DATA C OLLECTION FOR T RUST M ANAGEMENT
S YSTEM IN T WITTER
In this study, we need to collect data from Twitter that
is related to stock market. There are two algorithms used.
The rst one starts with stock market group then obtain the
followers and their tweets. For simplicity we are going to call it
Collect by Group. The other algorithm starts with ltering the
tweets based on specic keywords and then obtain the users
associated with those tweets. We call this method Collect by
Filtering.

Login to Facebook account.


Search a relevant group. For example, if the researcher interested in stock market data then he can
search for a relevant group such as Stock talk page.
Join or like the group page.
Open the tool and from the groups list, select stock
talks and enter the maximum latest post.
Select posts by pages and users associated with the
group.
Three les will be generated for the result. Each
having different eld.

A. Collect by Group
We collected data from 3 groups related to stock market.
The idea is to create an application that has different classes.
One of them has been designed to retrieve all the users

338

ID following the groups. Another class was for obtaining


data based on the acquired IDs. The data that have been
collected consist of users screen name, users locations,
users tweets and nally the date and time of posting the
tweet. After retrieving the data, it will be stored on a le for
later processing and analysis (Figure 2). The groups in which
we collect data from are summarized in table I.
The data that have been extracted were saved in a separate
le. So, before running the program, UserIDs have been
divided into blocks of 180 IDs. When extracting the tweets,
each 180 users and their corresponding data will be saved
in a le. Thus the number of les are total number of users
divided by 180. After completing the extraction, a program
has been written in order to merge all the les into one le
for later analysis.
B. Collect by Filtering
This approach requires several steps including nding
the right keywords that will be used in ltering and then
using them when requesting data from twitter (Figure 3).
The keywords to choose is based on the research interest. In
our case, it is related to stock market. Twitter offers special
symbols for stock market volume, which is the dollar sign
preceding the volume.

Fig. 2.

In order to nd the target tweets, it is essential to nd


a list of keywords that would help in ltering. Some
keywords were found in stocktwits, it is a website that has
been designed specically for stock market and twitter. It
shows what is happening with the stocks the people own
and which companies everyone is talking about per day.
The keywords that have been chosen are $YHOO, $P,
$EBAY, $BBRY, $TWTR, $V, $YELP, $MDR,
$SPLK, $HIMX, $LBTYA, $FEYE, $AMZN,
$NKE, $LNKD. The number of collected tweet from
May 5th until May 9th is 5367 with number of users equals
to 1811. The number of collected tweets from May 9th until
14th is 7595 with number of users equal to 2858. Table IV
and V summarizes the result.

Fig. 3.

Collecting Data From Stock Market by Group

Collecting Data From Stock Market by Filtering

to data collection for Trust Management System in Twitter.


For example, in term of programming language, there could
be some options other than java and python for Twitter. Table
III shows how collecting data by group is different from
collecting data by ltering.

V. R ESULT
Table I shows the corresponding information regarding
collecting data from 3 different groups in Twitter. We can
notice that the more the users the more time needed in
data collection. The reason of that is due to Twitter rate
limits regarding retrieving data, which is 350 requests per
hour according to twitter developer site. In order to speedup
the process, several tokens should recreated in order to
run several requests in parallel. Creating several tokens
helped in decreasing the duration required to collect the data
signicantly. As it can be seen from the table of collecting
around 1,700,000 user accounts in 2 months vs collecting
around 287,720 users in 2 months. Table II shows comparison
of data collection from different social networking sites based
on some criteria. The comparison is based on the work related

Moreover, to nd the number of iteractions among users, it is


important to extract the tweets that contain mentions to other
users. This is can be dome by nding text in a tweets starting
with @. Extracting the tweets that contain @ belongs to
specic mention has the following results based on collecting
by ltering technique.

339

From May 5th to 9th: number of interactions is 1795


tweets,
From May 9th to 13th: number of interactions is 3096
tweets.

TABLE III
C OLLECTING DATA FROM DIFFERENT S OCIAL N ETWORKING S ITES

TABLE I
C OLLECTING DATA FROM DIFFERENT GROUP IN T WITTER
Group

Number
of Users

Number
of Tweets

Approximate End
time
date
of
needed for
collection
collection

StockTwits

287,720

3,282,373

2 months

July 2013

FinancialTimes

1,700,000

1,576,889

2 months

October
2013

MarketWatch

928,066

9,909,333

Month

February
2014

Factor

By Group

By Filtering

Time needed

month or more
(depending
on
the number of
users)

Few minutes

Network Size

Large

Small

Data Relevance

Some data are


not relevant

Most of the data


are relevant

TABLE IV
C OLLECTING DATA BY F ILTERING IN M AY 5 TH - 9 TH
TABLE II
C OLLECTING DATA FROM DIFFERENT S OCIAL N ETWORKING S ITES
Criteria

Twitter

Facebook

LinkedIn

Yelp

Library/SDK

Twitter4j

Facebook
SDK

scribe

scribe

Programming
Language

java,
python

php,
javascript

java

java

Authentication
and Authorization

OAuth

OAuth

OAuth

OAuth

The reason behind extracting tweets with mention is to nd


the users pairs and then evaluate the relationship between each
pairs.
VI. C ONCLUSION
Data collection is an important stage for any research or
project since the data is considered the main component
in which we use for investigation and evaluation. Social
Networking Sites are a very powerful source for the data. Thus,
using the corresponding API facilitate collection process. In
this project we demonstrated different techniques on collecting
data from Twitter. The target was stock market related data.
One technique started with the group and from that the users
are obtained. The other technique started with a key word and
then nd the corresponding tweets and, in turn, obtaining the
users to build a network of users. Each technique has its own
advantage. For future work, combining both techniques will
be considered in order to evaluate the effectiveness by taking
the advantage of each technique. It is going to be helpful since
the network size will be larger compared with the size built
from Collecting by Filtering. At the same time, the tweets will
be more relevant compared with the one collected by Group
technique.

Keyword

Number of tweets

$YHOO

299

$P

299

$EBAY

399

$BBRY

664

$TWTR

99

$V

99

$YELP

997

$MDR

108

$SPLK

237

$HIMX

158

$LBTYA

245

$FEYE

397

$AMZN

898

$NKE

169

$LNKD

299

TABLE V
C OLLECTING DATA BY F ILTERING IN M AY 9 TH -13 TH

ACKNOWLEDGMENT
The authors would like to thank Twitter developers for
providing documentation that was helpful when using the
Twitter API for implementing the program needed for data
collection.

Keyword

Number of tweets

$YHOO

901

$P

99

$EBAY

335

$BBRY

670

$TWTR

1198

$V

716

$YELP

199

$MDR

63

$SPLK

80

$HIMX

283

$LBTYA

137

$FEYE

499

$AMZN

1411

$NKE

99

$LNKD

905

R EFERENCES
[1] M. A. Russell, Mining the Social Web. OReilly, 2nd ed., October 2013.
22.

340

[2] J. Y. Park, K. Daejeon, and C.-W. Chung, Swhen daily deal services
meet twitter: Understanding twitter as a daily deal marketing platform,
Proceedings of the 3rd Annual ACM Web Science Conference, 2012.
[3] Y.-C. Wang, M. Burke, and R. Kraut., Security in vehicular ad-hoc
network with identity-based cryptography approach: A survey, CHI
2013: Changing Perspectives, Paris, France, 2013.
[4] Y. Mao, W. Wei, and B. Wang, Twitter volume spikes: Analysis and
application in stock trading, Proceedings of the 7th Workshop on Social
Network Mining and Analysis, 2013.
[5] G. Comarela, V. Almeida, M. Crovella, and F. Benevenuto, Understanding factors that affect response rates in twitter, Proceedings of the 23rd
ACM conference on Hypertext and social media, pp. 123132, 2012.
[6] T. Developer, Rate limiting, https://dev.twitter.com/docs/ratelimiting/1, 2013.
[7] B. Gokulakrishnan, P. Priyanthan, T. Ragavan, N. Prasath, and A. Perera,
Opinion mining and sentiment analysis twitter data stream, Advances
in ICT for Emerging Regions (ICTer), 2012 International Conference
on, pp. 182 188, 2012.
[8] F.
Developer,
Using
the
graph
api,
https://developers.facebook.com/docs/graph-api/using-graph-api/,
2014.
[9] B. Rieder, Studying facebook via data extraction: The netvizz application, Proceedings of the 5th Annual ACM Web Science Conference,
2013.
[10] B. BROWN, Beyond recommendations: Local review web sites and
their impact, ransactions on Computer-Human Interaction (TOCHI),
vol. 19, no. 4, 2012.
[11] D. Agarwal, Computational advertising: The linkedin way, Proceedings of the 22nd ACM International Conference on Conference on
Information and Knowledge Management, vol. 19, no. 4, 2013.

341

Você também pode gostar