Você está na página 1de 8

2010 IEEE International Conference on Services Computing

X-RIME: Cloud-Based Large Scale Social Network Analysis

Wei Xue, JuWei Shi, Bo Yang

IBM Research China
{weixue, jwshi, boyang}@cn.ibm.com

Abstract unprecedentedly huge and fast growing user communities.

For example, Facebook has more than 400 million active
As an important technique in modern sociology, social users by March 2010, and this number increases quickly.
network analysis has gained a lot of attention from many Any viable SNA solution for such users should be able to
disciplines, and been used as important complements to handle massive and fast growing data sets with affordable
traditional statistics and data analysis. In order to make cost, which in turn requires good scalability and
it affordable for analysts with massive and fast growing reliability. However, such challenges overwhelm most
networks, we present X-RIME, a cloud-based library for existing SNA libraries and tools. Furthermore,
large scale social network analysis. We propose an considering the large volume of raw data sets today, there
implementation-oriented classification of social network is a great need to integrate SNA with analysis methods
analysis structures and structure variables to guide the like data warehouse and data mining on the same
implementation of them over MapReduce parallel infrastructure, so as to build more comprehensive
programming model. The layered architecture of the business intelligence solutions with better cost
library is described along with design consideration of performance.
each layer. By sharing the same infrastructure and Google-like cloud infrastructures gain lots of attention
integrating with existing cloud-based data warehouse and in recent years. The combination of reliable distributed
data mining libraries and tools, more comprehensive and file system [3] and scalable parallel programming model
cost effective social network aware business intelligence of MapReduce [4] provides an ideal platform for many
solutions could be built. We present several case studies large volume data processing workloads. New data
on an online community to illustrate the usage of X-RIME warehouse and data mining tools such as JAQL [5] and
library. The performance of the library is evaluated with Mahout [6] have been built on it. We argue that the
experiments, which demonstrates good scalability. platform is also suitable for social network analysis over
massive and fast growing data sets. In this paper, we
1. Introduction propose to build cloud-based services for large scale
social network analysis with the same infrastructure and
As an important technique and research area in modern technologies from the cloud computing society. Our main
sociology, social network analysis (SNA) focuses on contributions are: 1) We propose an implementation-
relationships among social entities and the patterns and oriented classification of SNA structures and structure
implications of these relationships [1]. Such a perspective variables, which guides the implementation of them over
naturally complements traditional statistics and data MapReduce parallel programming model; 2) We then
analysis, and appeals to researchers and analysts from introduce X-RIME, an open source library built on
many other disciplines, such as anthropology, economics, Apache Hadoop [7] for social network analysis on
biology, organizational management, fraud detection, massive social data sets. The architecture of the library is
security, etc. Even though the meaning of entities and described along with design consideration of each layer; 3)
relationships highly depends on the usage scenario, the We illustrate the usage and preliminary performance
methodology and analytic concepts involved are the same. result of X-RIME with real scenarios, which demonstrates
In 1990 or so, interest in social networks and use of the scalability of the library.
the wide-ranging collection of social network The rest of this paper is organized as follows: Section
methodology began to grow rapidly [2]. Many interesting 2 presents related works. In Section 3, the classification
structures (regular patterns in relationship) and structure and the layered architecture X-RIME are described.
variables (quantities that measure structures) [1] were Section 4 describes case studies on a real world online
defined and studied. In parallel, libraries and tools were user community. Section 5 presents preliminary
developed to help social network analysts, and many performance results of X-RIME. Section 6 concludes this
successful applications were built with them. Nowadays, paper and discusses future works.
SNA users like Internet-based social network sites and
telecom service providers usually possess 2. Related Works

978-0-7695-4126-6/10 $26.00 © 2010 IEEE 506

DOI 10.1109/SCC.2010.41
Social network analysis is a research topic with long reliable parallelism on commodity machines, which is
history and tremendous amount of literatures. Since the appealing to all real world data processing system
focus of our work is only on cloud-based implementation builders. There are a few examples of MapReduce-based
of SNA algorithms, the definition and meaning of SNA graph algorithms presented in [16] and [17], which
structures / structure variables, and the corresponding inspired our work. But we went further beyond that. By
fundamental (sequential and centralized) algorithms are providing an implementation-oriented classification and a
beyond the scope of this paper. Only implementations layered architecture, we lay a foundation and provide a
(libraries and tools) and advancements in parallel and guide for cloud-based SNA implementation. Based on
distributed algorithms are considered and discussed. that, an open source library is developed, which could be
There are quite a few SNA libraries and tools in use integrated with statistics and data analysis / mining tools
today. A fairly complete list of them could be found at like JAQL [5] and Mahout [6] to build more
corresponding Wikipedia page [8]. The list indicates that comprehensive business intelligence solutions. This
most of those libraries and tools run in a single server or amplifies the value of those components being integrated,
even single threaded mode, and can not fully unleash the which is an extra benefit not provided by existing
power of contemporary hardware (e.g., multi-core / multi- distributed SNA libraries or tools.
processor machines, high speed networks). As a result,
such libraries and tools usually lack the scalability 3. Design Overview
required by today’s SNA users, especially when dealing
with large networks and SNA structures / structure
3.1 Classification
variables with high complexity.
As the advancement of hardware technologies, more
During the long history of social network analysis,
and more researchers start to design parallel and
researchers and analysts with different background used
distributed algorithms for graphs and network analysis.
SNA methodology to study broad usage scenarios (from
To name a few of them, [9] proposes to leverage massive
social statistics to anti-fraud, from finance to telecom
multithreading to do search-based graph queries. In [10],
industry), identified different structures and structure
fast parallel algorithms for evaluating several centrality
variables, and developed models and algorithms
indices frequently used in complex network analysis are
accordingly. However, there is no taxonomy of structures
discussed. In [11], an experimental study of parallel bi-
/ structure variables and algorithms proposed in this field
connected component (BCC) algorithms is presented.
as far as we know. Textbooks like [1] and [2] usually
Distributed algorithms for finding strongly connected
choose to organize structures / structure variables and
components, all maximal cliques and minimum-weight
algorithms into abstract themes, which are built on
spanning trees are discussed in literatures [12], [13] and
mathematical foundations like graph theory, statistical /
[14] respectively. The Parallel Boost Graph Library [15]
probability theory, and algebraic models. On the other
offers distributed graphs and graph algorithms that exploit
hand, libraries and tools like pajek [18] either choose to
coarse-grained parallelism (among worker nodes), along
organize their functionalities in similar ways, or just list
with parallel algorithms that exploit fine-grained
them as routines.
parallelism (within each worker node). Although these
In this paper, we propose an implementation-oriented
works provide hints for cloud-based implementation,
classification of SNA structures / structure variables
many of them highly depend on shared-memory
specific to MapReduce programming model. The
architecture to load the graph in main memory and get it
narrowed scope enables us to provide more specific
accessed by multiple processors. However, such machines
paradigms to guide the implementation, while still
are usually not cost efficient, especially when compared
supports many useful structures / structure variables. The
with clusters built with commodity computers.
core idea of the classification is to categorize SNA
Distributed algorithms are generally more scalable and
structures / structure variables according to their data and
cost efficient. But few existing works provide
computing locality while implemented with MapReduce.
mechanisms for features like fault tolerance, which are
The first category consists of structure variables
important for real world large scale SNA, especially when
defined on individual vertex or edge. Examples include
running on commodity hardware and networks.
partitioning networks by in-degree, out-degree, the sum
MapReduce has recently gained a lot of attention as a
and / or average of them. Apparently, such variables have
parallel programming model for large scale data analysis.
good data and computing locality, and could be calculated
Its unique attractiveness comes from two facets. One is its
by counting edges attached to vertexes, or vertexes
simple programming model, which significantly
attached to edges. Despite the simplicity, they are
simplified development without compromising the
important and frequently used.
expressiveness too much. The other is the ability to
achieve automated, scalable, economical, efficient and

The second category consists of structures / structure 3.2 X-RIME Architecture
variables defined on sub-graphs, mostly and most
importantly, defined on the egocentric network of each Figure 1 illustrates the layers constitute X-RIME,
vertex. Examples of this kind are egocentric network which are enclosed in the dash box. Like many other
density, clustering coefficient and maximal cliques. Hadoop based data analytics solutions, at the bottom layer
Typically, two phases are involved in the implementation of X-RIME is Hadoop HDFS. The raw input data, the
of them. The first phase is to find or construct sub-graphs. internal representation of networks after transformation
This phase usually involves exploring the neighborhood and cleansing, the intermediate result generated in each
surrounding each vertex by propagating messages along step of algorithms, and the final output of X-RIME will
edges / arcs among adjacent vertexes. The result could be all be stored with HDFS as files or directories.
kept and shared among structure / structure variables of
this kind. The second phase is to calculate structures /
structure variables within each sub-graph. As a result,
each sub-graph could be processed in parallel. This
paradigm makes the implementation ready for scaling out
on networks with even-distributed sub-graph sizes.
The third category consists of structures / structure
variables defined on the whole graph. Most of those
interesting SNA structures / structure variables belong to
this category, for example, k-core, weakly / strongly / bi-
connected components, PageRank [19], hyperlink
induced topic search [20], minimal spanning tree, breadth
first search, and so on. To calculate them, the whole
graph needs to be explored in some way. The
implementation usually involves multiple rounds of
iterations. Although operations done in each round are
highly algorithm dependent, most of them could be
regarded as generating labels on the fly and propagating Figure 1 Architecture of X-RIME
them through edges / arcs among vertexes. The output of
one round is the input of the next round, and some Above the HDFS layer is X-RIME’s data models.
algorithm dependent termination condition needs to be Since social networks are graphs consist of nodes of
checked at the end or beginning of each round. Since the actors and edges / arcs of relationships, two kinds of
iterations may take dozens of rounds, network traffic and models are usually used, namely adjacency list and
job scheduling latency would be more significant than the adjacency matrix. Generally speaking, adjacency list is
second category. We are currently investigating more suitable for sparse graphs where the edge number is
techniques besides compression to mitigate such effects. much smaller than square of the node number. On the
Structures / structure variables of the last category are other hand, adjacency matrix is more suitable for dense
those defined on vertex and edge sets. Compared with the graphs. Since most social networks today are sparse
third category, they depend not only on links existing in graphs, adjacency list could result in much more compact
the graph, but also on links that do not exist in the graph. representation. Moreover, many real world social
For example, force-directed network layout algorithms networks are inherently or most naturally encoded as
and community discovery algorithms usually need to adjacent node pairs. For example, short message logs in
calculate the repulsive force between each pair of telecom domain encode a pair of sender and receiver in a
vertexes. Generally speaking, MapReduce programming record. So, distributed adjacency list is chosen as the
model is not suitable for this category, since the “any to basic data model in X-RIME.
any” links usually mean large volume of intermediate Beyond distributed adjacency lists, we designed a few
results, and the graph structure could not be leveraged to more data models to accommodate different algorithms
mitigate such explosion. Nevertheless, approximate and application requirements. For example, quite a few
algorithms exist for some structures / structure variables algorithms do not care the order of arcs attached to each
of this kind, which usually have specific techniques to vertex, so we replace lists with sets of incoming and
reduce the calculation space. The grid variant of outgoing arcs. To accommodate weighted graphs and
Fruchterman-Reingold network layout algorithm [21] is algorithms which need to explore networks, we attached
an example, which could leverage the fact that repulsive labels to vertexes and arcs. Actually, labels are important
force is inversely proportional to the square of distance to constructs in X-RIME and used by many SNA algorithms.
reduce the number of pairs considered.

Routines are provided to do the transformation among board to board. Registered users could read, post or reply
data models. articles in any board, which means those sub-
As a preprocessing step, raw input data is cleaned and communities might overlap with each other. Both the
transformed into X-RIME data models before the real online community and several sub-communities are
analysis begins. This step is specific to the usage scenario analyzed in the following part of this section.
and should be implemented by X-RIME users. In this paper, we regard article authors (identified by
Above the data model layer are Hadoop MapReduce registered user ids) as SNA actors, and regard the reply
layer and a layer of SNA algorithms implementations. relationship among authors as SNA relationship.
Each step of a SNA algorithm in X-RIME is programmed Specifically, if an author A creates a new discussion topic
as a map()/reduce() pair and executed as a MapReduce by posting an article, and author B replies to articles in
job. Algorithm specific control flow and data flow this topic, a SNA relationship is created between A and B,
between steps are programmed as Hadoop MapReduce and represented as an arc from node B to node A in the
clients who create and submit jobs to the MapReduce social network. For two authors who reply to the same
runtime. Although the actual flows involved could be topic, there is no SNA relationship created between them.
sequential, parallel or hybrid, as shown in Figure 1, all Since a user may publish any number of articles in any
such Hadoop MapReduce clients are encapsulated in the number of boards, the relationship between any two
same interface, which takes the following arguments: authors might be created in contexts of multiple topics
z An input HDFS file / directory which stores the and / or multiple boards. For simplicity, MapReduce
input network after cleansing and transformation, programs are developed to remove redundancy in the raw
z An output HDFS file / directory used to store the data set and generate networks that have no loops and no
final output of the algorithm, more than one edge between any two different vertices.
z MapReduce specific parameters, such as mapper / As the disk space of the BBS is limited, administrators
reducer number, etc. of the community and sub-communities choose to delete
z Algorithm specific parameters. old articles periodically. As a result, we could not get a
When the MapReduce client for a SNA algorithm is complete historical view of those communities. Instead,
invoked, the caller will be blocked until all steps are we use web crawlers to create a snapshot of the whole
finished or some step fails. Currently, X-RIME provides BBS, which contains articles in the past 3 months or so.
SNA algorithms as a library. We are going to expose Web There are about 200K nodes and about 1.6M arcs in the
Service interfaces in near future. social network corresponding to this snapshot.
At the top of Figure 1 are social network aware
business intelligence applications. They invoke X-RIME 4.1 Degrees
library when they need to calculate any SNA structure or
structure variable. For real world business intelligence Although statistics on in and out degrees is the most
applications that need the functionality of data warehouse simple functionality provided by X-RIME, it is still
and data mining as well, X-RIME could be integrated helpful to understand basic properties of a community.
with Hadoop-based data warehouse and data mining Figure 2 illustrates the distribution of in degree, out
solutions. At lower layers, they could share the same degree and sum of them in 3 selected boards named
HDFS and MapReduce infrastructure. At the library or Circuit, MilitaryView and Career_POST. Among them,
Web Service layer, they could invoke each other on the Circuit board is for circuit technology related topics;
fly. In this way, more comprehensive and cost effective MilitaryView board is for military related topics; and
BI solutions could be built. Career_POST board is for announcement of job
4. Case Studies As shown in the figure, all three boards have major
parts of their population consist of inactive actors, who
X-RIME is the output of an ongoing research project, seldom post or reply articles and have small degrees. This
and has been published as an open source project at phenomenon is quite common in all kinds of real world
xrime.sourceforge.net. In this section, we use an online communities, where only a few active actors lead and
community as the example to present several usage drive social activities. We can also see many isolated
scenarios of X-RIME. Preliminary performance results actors (with 0 “in + out” degree), who posted articles and
will be introduced in Section 5. did not get any attention from others. Such actors join and
The online community is in the form of a bulletin leave the community occasionally, and constitute the
board system (BBS) consists of a bunch of boards. Each most instable part of the community.
board has its focused themes and could be regarded as a
sub-community. The sizes of sub-communities vary from

MilitaryView boards are more like real social groups,
where people have common interested topics for
interactions to take place. Among these two boards,
MilitaryView community is the one more tightly coupled,
which has less isolated actors and more interactions
among actors.

4.2 Connected Components

Among all SNA structures, connected components are

basic methods to partition a social network. Currently, X-
RIME supports weekly connected components (WCC),
strong connected components (SCC) and bi-connected
components (BCC). Figure 3 illustrates the population
distributions among WCCs in the 3 boards, and makes the
(a) Circuit board
difference among them clearer.

(b) MilitaryView board

Figure 3 Population distributions among WCCs

As shown in the figure, the largest WCC in a tightly

connected community like Circuit or MilitaryView boards
holds a significant part of its populations. This indicates
that most actors in the community are connected with
each other directly or indirectly. A tighter community
usually has less number of WCCs, and the ratio of the
largest WCC with respect to the whole community is
larger. On the contrary, Career_POST board consists of
many small WCCs, which indicates its looseness as a
social network.

4.3 Cores and Cliques

(c) Career_POST board K-cores and maximal cliques are useful SNA
Figure 2 Degree distribution in 3 boards structures when studying the topology and construction of
a social network. Particularly, K-cores could be used to
By comparing the three boards, we can see that simplify networks for analysis or visualization purpose.
Career_POST community is the loosest one. The The k value indicates the nesting level of the core within
dominant and largest degrees of actors in it are much less the network. Maximal cliques could be used to find core
than the other two communities. This is natural since members and structures in a community, among many
people come to this board only for information, and no other potential usages. X-RIME supports both of them.
discussion involved. On the contrary, Circuit and

largest a few k values. Specifically, the k-core with the
largest k value almost covers all maximal cliques with
more than 2 vertexes. This phenomenon is common for
most boards in this community, and could be used to
reduce the search space for maximal strong cliques.
Exceptions to this phenomenon are boards like
Career_POST, where only a few small maximal strong
cliques exist.

Figure 4 Population distributions among k-cores

Figure 4 illustrates the population distribution among

k-cores of different k values in the 3 boards. Apparently,
the upper bound of k value in a more tightly connected
community is usually larger than that in a less tightly
connected community.

Figure 6 K-cores and maximal strong cliques in

Circuit board
4.4 PageRank

PageRank is a link analysis algorithm originally used

to assign a numerical weight to each element of a
hyperlinked set of documents. The numerical weight
Figure 5 Numbers of maximal strong cliques of assigned to any given element E is also called the
different sizes PageRank of E. Its usage could naturally be extended to a
social network, where the numerical weight is assigned to
Figure 5 illustrates the number of maximal strong each actor and measures the relative importance of the
cliques of different size in the whole BBS community. actor in the network. Besides PageRank, X-RIME also
This number decreases quickly as the size of maximal supports Hyperlink-Induced Topic Search (HITS), which
strong clique increases, which reflects the quickly is another algorithm of this kind.
increasing cost to build tightly coupled structures like
Figure 6 visualizes k-cores and maximal strong cliques
in Circuit community. Vertexes in k-cores are arranged in
concentric circles. The distance between a vertex and the
centre of circles negatively correlates with the highest k
value of k-cores to which the vertex belongs. Vertexes in
maximal strong cliques are drawn in green, while the
others are drawn in red. The bi-directional edges within
maximal strong cliques are drawn in red, while the others
are drawn in black. As we can see in this figure, most
maximal strong cliques are included in k-cores with the
(a) Article numbers

and TaskTracker). Default configuration is used for those
Hadoop clusters.

(b) PageRank (a) CPU Usage of Maximal Strong Clique Algorithm

Figure 7 Article Number vs. PageRank in
MilitaryView board

Figure 7 illustrates the numbers of articles published

by 20 most active authors in MilitaryView board, along
with their PageRank values. The authors are sorted
according to their article numbers. The difference
between 7(a) and 7(b) indicates the difference between
“active authors” and “opinion leaders”, while the latter
(b) Network Usage of Maximal Strong Clique Algorithm
one is more important in usage scenarios like marketing
and customer churn preventing.

5. Experimental Results

(c) CPU Usage of Weakly Connected Component Algorithm

(d) Network Usage of Weakly Connected Component Algorithm

Figure 8 X-RIME Scales Out Figure 9 CPU and Network Usage of X-RIME

We evaluated the performance of X-RIME on the Figure 8 illustrates performance results of maximal
social network of the online BBS community mentioned strong clique and weakly connected component
in last section. 7 compute nodes connected via 1Gb implemented in X-RIME, which are representatives of the
Ethernet are used to construct Hadoop clusters of second and third categories according to our classification
different sizes. Each node contains 4GB main memory respectively. The horizontal axis indicates the number of
with two 1.66GHz Intel Xeon processors, but for our tests, slave nodes in the Hadoop cluster, while the vertical axis
we only enable 1 processor / core on each node for indicates the wall-clock time used by the processing. We
simplicity. Among those nodes, a dedicated node is used note that the scalabilities of both algorithms are quite
as the master node (running NameNode and JobTracker), good here, while the weakly connected component
and the other nodes are used as slaves (running DataNode algorithm scales slightly better than the maximal strong
clique algorithm. This is because that the social network

is sparse and the connectivity of vertexes is imbalance [5] JAQL - Query Language for JavaScript Object
across the network. This in turn causes the workload Notation (JSON). http://code.google.com/p/jaql/.
imbalance among vertexes which has larger impact on the [6] Apache Mahout. http://lucene.apache.org/mahout/.
maximal strong clique algorithm than on the weakly [7] Apache hadoop. http://hadoop.apache.org/core.
connected component algorithm. We are currently [8] Social network analysis software,
investigating heuristics for input splitting in order to http://en.wikipedia.org/wiki/Social_network_analysis_sof
balance workload among slave nodes. tware.
Figure 9 illustrates the cluster CPU and network usage [9] Jonathan W. Berry, Bruce Hendrickson, Simon Kahan,
extracted with Ganglia Monitoring System [21] when and Petr Konecny. Software and Algorithms for Graph
executing these two algorithms. We note that the maximal Queries on Multithreaded Architectures. In IEEE IPDPS,
strong clique algorithm and the weakly connected pages 1-14, 2007.
component algorithm are both compute intensive, and the [10] David A. Bader and Kamesh Madduri. Parallel
network bandwidth is not a bottleneck. This means the Algorithms for Evaluating Centrality Indices in Real-
implementation of both algorithms in X-RIME could world Networks. In ICPP, pages 539-550, 2006.
scale out to much larger social networks and clusters [11] Guojing Cong and David A. Bader. An Experimental
without upgrading existing 1Gb network infrastructure. Study of Parallel Biconnected Components Algorithms on
Symmetric Multiprocessors (SMPs). In IEEE IPDPS,
6. Conclusion and Future Works page 45b, 2005.
[12] Will McLendon III, Bruce Hendrickson, Steven J.
In this paper, we have introduced X-RIME, a cloud- Plimpton, and Lawrence Rauchwerger. Finding strongly
based library for large scale social network analysis. An connected components in distributed graphs. In Journal
implementation-oriented classification and a layered of Parallel and Distributed Computing, 65(8): 901-910,
architecture are proposed to guide the development of 2005.
SNA algorithms. By sharing the same infrastructure and [13] Fábio Protti, Felipe M. G. França, and Jayme Luiz
integrating with existing cloud-based data warehouse and Szwarcfiter. On Computing All Maximal Cliques
data mining libraries and tools, more comprehensive and Distributedly. In Proceedings of the 4th International
cost effective social network aware business intelligence Symposium on Solving Irregularly Structured Problems
solutions could be built. We have also presented several in Parallel, pages 37-48, 1997.
case studies on an online community to illustrate the [14] Robert G. Gallager, Pierre A. Humblet, and Philip M.
usage of X-RIME. The performance of X-RIME is also Spira. A Distributed Algorithm for Minimum-Weight
evaluated with experiments, which shows good scalability Spanning Trees. ACM Transactions on Programming
on large social networks and clusters. Languages and Systems (TOPLAS), 5(1): 66-77, 1983.
X-RIME is an ongoing research project and an open [15] Douglas Gregor and Andrew Lumsdaine. The
source project at the same time. Future works include Parallel BGL: A Generic Library for Distributed Graph
extending the functionality with more SNA structures and Computations. In Parallel Object-Oriented Scientific
structure variables, and enhancing Hadoop MapReduce Computing (POOSC), 2005.
framework to better support the implementation of SNA [16] Cluster Computing and MapReduce.
algorithms. http://code.google.com/intl/en/edu/submissions/mapreduc
[17] Jimmy Lin. Graph Algorithms with MapReduce.
7. References http://www.umiacs.umd.edu/~jimmylin/cloud-2008-
[1] Wasserman, Stanley, and Faust, Katherine. Social [18] Wouter de Nooy, Andrej Mrvar, and Vladimir
Network Analysis: Methods and Applications. Cambridge: Batagelj. Exploratory Social Network Analysis with
Cambridge University Press, 1994. Pajek. New York: Cambridge University Press, 2004.
[2] Carrington, Peter J., John Scott, and Stanley [19] Lawrence Page, Sergey Brin, Rajeev Motwani, and
Wasserman (Eds.). Models and Methods in Social Terry Winograd. The PageRank Citation Ranking:
Network Analysis. New York: Cambridge University Bringing Order to the Web. Technical Report, Stanford
Press, 2005. InfoLab, 1999.
[3] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak [20] Jon M. Kleinberg. Authoritative Sources in a
Leung. The Google file system. In SOSP, pages 29-43, Hyperlinked Environment. Journal of ACM, 46(5): 604-
2003. 632, 1999.
[4] Jeffrey Dean and Sanjay Ghemawat. MapReduce: [21] Ganglia Monitoring System. http://ganglia.info/.
Simplified Data Processing on Large Clusters. In OSDI,
pages 137-150, 2004.