Você está na página 1de 7

Pydoop: a Python MapReduce and HDFS API for Hadoop

Simone Leo

Gianluigi Zanetti

CRS4
Pula, CA, Italy

CRS4
Pula, CA, Italy

simone.leo@crs4.it

gianluigi.zanetti@crs4.it

ABSTRACT
MapReduce has become increasingly popular as a simple
and efficient paradigm for large-scale data processing. One
of the main reasons for its popularity is the availability of a
production-level open source implementation, Hadoop, written in Java. There is considerable interest, however, in tools
that enable Python programmers to access the framework,
due to the languages high popularity. Here we present a
Python package that provides an API for both the MapReduce and the distributed file system sections of Hadoop, and
show its advantages with respect to the other available solutions for Hadoop Python programming, Jython and Hadoop
Streaming.

Categories and Subject Descriptors


D.3.3 [Programming Languages]: Language Constructs
and FeaturesModules, packages

1.

INTRODUCTION

In the past few years, MapReduce [16] has become increasingly popular, both commercially and academically, as a
simple and efficient paradigm for large-scale data processing. One of the main reasons for its popularity is the availability of a production-level open source implementation,
Hadoop [5], which also includes a distributed file system,
HDFS, inspired by the Google File System [17]. Hadoop, a
top-level Apache project, scales up to thousands of computing nodes and is able to store and process data in the order
of the petabytes. It is widely used across a large number
of organizations [2], most notably Yahoo, which is also the
largest contributor [7]. It is also included as a feature by
cloud computing environments such as Amazon Web Services [1].
Hadoop is fully written in Java, and provides Java APIs to
interact with both MapReduce and HDFS. However, programmers are not limited to Java for application writing.
The Hadoop Streaming library (included in the Hadoop dis-

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
HPDC'10, June 2025, 2010, Chicago, Illinois, USA.
Copyright 2010 ACM 978-1-60558-942-8/10/06 ...$10.00.

819

tribution), for instance, allows the user to provide the map


and reduce logic through executable scripts, according to a
fixed text protocol. This, in principle, allows to write an
application using any programming language, but with restrictions both for what concerns the set of available features
and the format of the data to be processed. Hadoop also includes a C library to interact with HDFS (interfaced with
the Java code through JNI) and a C++ MapReduce API
which takes advantage of the Hadoop Pipes package. Pipes
splits the application-specific C++ code into a separate process, exchanging serialized objects with the framework via
a socket. As a consequence, a C++ application is able to
process any type of input data and has access to a large subset of the MapReduce components. Hadoops standard distribution also includes Python examples that are, however,
meant to be run on Jython [13], a Java implementation of
the language which has several limitations compared to the
official C implementation (CPython).
In this work, we focused on building Python APIs for both
the MapReduce and the HDFS parts of Hadoop. Python is
an extremely popular programming language characterized
by very high level data structures and good performances.
Its huge standard library, complemented by countless thirdparty packages, allows it to handle practically any application domain. Being able to write MapReduce applications
in Python constitutes, therefore, a major advantage for any
organization that focuses on rapid application development,
especially when there is a consistent amount of internally
developed libraries that could be reused.
The standard and probably most common way to develop
Python Hadoop programs is to either take advantage of
Hadoop Streaming or use Jython. In Hadoop Streaming, an
executable Python script can act as the mapper or reducer,
interacting with the framework through standard input and
standard output. The communication protocol is a simple
text-based one, with new line characters as record delimiters
and tabs as key/value separators. Therefore, it cannot process arbitrary data streams, and the user directly controls
only the map and reduce parts (i.e., you cant write a RecordReader or Partitioner). Jython is a Java implementation
of the Python language, which allows a Python programmer
to import and use Java packages. This, of course, allows
access to the same features available to Java. However, this
comes at a cost to the Python programmer: Jython is typically, at any given time, one or more releases older than
CPython, it does not implement the full range of standard

library modules and does not allow to use modules written as C/C++ extensions. Usability of existing Python libraries for Hadoop with Jython is therefore only possible if
they meet the above restrictions. Moreover, the majority
of publicly available third-party packages are not compatible with Jython, most notably the numerical computation
libraries (e.g., numpy [10]) that constitute an indispensable
complement to Python for scientific programming.
Our main goal was to provide access to as many as possible of the Hadoop features available to Java while allowing
compact CPython code development with little impact on
performances. Since the Pipes/C++ API seemed to meet
these requirements well, we developed a Python package,
Pydoop, by wrapping its C++ code with Boost.Python [15].
We also wrapped the C libhdfs code to made HDFS operations available to Python. To evaluate the package in terms
of performances, we ran a series of tests, purposely running a
very simple application in order to focus on interaction with
the Java core framework (any application with a nontrivial
amount of computation performed inside the mapper and/or
reducer would likely run faster if written in C++, independently of how it interfaces with the framework), where we
compare Pydoop with the other two solutions for Python application writing in Hadoop, Jython and Hadoop Streaming
(with Python scripts), and also to Java and C++ to give
a general idea of how much performance loss is to be expected when choosing Python as the programming language
for Hadoop. Pydoop is currently being used in production
for bioinformatics applications and other purposes [18]. It
currently supports Python 2.5 and Python 2.6.
The rest of the paper is organized as follows. After discussing related work, we introduce Pydoops architecture
and features; we then present a performance-wise comparison of our implementation with the other ones; finally, we
give our conclusions and plans for future work.

2.

RELATED WORK

Efforts aimed at easing Python programming on Hadoop include Happy [6] and Dumbo [4]. These frameworks, rather
than providing a Java-like Python API for Hadoop, are
focused on building high-level wrappers that hide all job
creation and submission details from the user. As far as
MapReduce programming is concerned, they are built upon,
respectively, Jython and Hadoop Streaming, and thus they
suffer from the same limitations.
Existing non-Java HDFS APIs use Thrift [14] to make HDFS
calls available to other languages [8] (including Python) by
instantiating a Thrift server that acts as a gateway to HDFS.
In contrast, Pydoops HDFS module, being built as a wrapper around the C libhdfs code, is specific to Python but does
not require a server to communicate with HDFS.
Finally, there are several non-Hadoop Mapreduce implementations available to write applications in languages other
than Java. These include Starfish [12] (Ruby), Octopy [11]
(Python) and Disco [3] (Python, framework written in Erlang). Since Hadoop is still the most popular and widely
deployed implementation, however, we expect our Python
bindings to be useful for the majority of MapReduce programmers.

820

Figure 1: Hadoop pipes data flows. Hadoop communicates with user-supplied executables by means of
a specialized protocol. Almost all components can
be delegated to the user-supplied executables, but,
at a minimum, it is necessary to provide the Mapper
and Reducer classes.

3.
3.1

ARCHITECTURE
MapReduce Jobs in Hadoop

MapReduce [16] is a distributed computing paradigm tailored for the analysis of very large datasets. It operates on
a input key/value pair stream that is converted into a set
of intermediate key/value pairs by a user-defined map function; the user also provides a reduce function that merges
together all intermediate values associated to a given key
to yield the final result. In Hadoop, MapReduce is implemented according to a master-slave model: the master,
which performs task dispatching and overall job monitoring,
is called Job Tracker; the slaves, which perform the actual
work, are called Task Trackers.
A client launching a job first creates a JobConf object whose
role is to set required (Mapper and Reducer classes, input and output paths) and optional job parameters (e.g.,
the number of mappers and/or reducers). The JobConf is
passed on to the JobClient, which divides input data into
InputSplits and sends job data to the Job Tracker. Task
Trackers periodically contact the Job Tracker for work and
launch tasks in separate Java processes. Feedback is provided in two ways: by incrementing a counter associated to
some application-related variable or by sending a status report (an arbitrary text message). The RecordReader class
is responsible for reading InputSplits and provide a recordoriented view to the Mapper. Users can optionally write
their own RecordReader (the default one yields a stream of
text lines). Other key components are the Partitioner, which
assigns key/value pairs to reducers, and the RecordWriter,
which writes output data to files. Again, these are optionally written by the users (the defaults are, respectively, to
partition based on a hash value of the key and to write tabseparated output key/value pairs, one per line).

>>> import os
>>> from pydoop.hdfs import hdfs
>>> fs = hdfs("localhost", 9000)
>>> fs.open_file("f",os.O_WRONLY).write(open("f").read())

Figure 3: A compact HDFS usage example. In this


case, a local file is copied to HDFS.

from pydoop.hdfs import hdfs

Figure 2: Integration of Pydoop with C++. In


Pipes, method calls flow from the framework
through the C++ and the Pydoop API, ultimately
reaching user-defined classes; Python objects are
wrapped by Boost.Python and returned to the
framework. In the HDFS wrapper, instead, function calls are initiated by Pydoop.

3.2

Hadoop Pipes

Fig. 1 shows data flows in Hadoop Pipes. Hadoop uses a


specialized class of tasks, Pipes tasks, to communicate with
user-supplied executables by means of a protocol that uses
persistent socket connections to exchange serialized objects.
The C++ application provides a factory that is used by the
framework to create the various components it needs (Mapper, Reducer, RecordReader, Partitioner . . . ). Almost all
Hadoop framework components can be overridden by C++
implementations, but, at a minimum, the factory should provide Mapper and Reducer object creation.
Fig. 2 shows integration of Pydoop with the C++ code.
In the Pipes wrapper, method calls flow from the framework through the C++ code and the Pydoop API, ultimately reaching user-defined classes; Python objects resulting from these calls are returned to the framework wrapped
in Boost.Python structures. In the HDFS wrapper, the control flow is inverted: function calls are initiated by Pydoop
and translated into their C equivalent by the Boost.Python
wrapper; resulting objects are wrapped back and presented
as Python objects to the application level.

4.

FEATURES

Pydoop allows to develop full-fledged MapReduce applications with HDFS access. Its key features are: access to most
Mapreduce application components: Mapper, Reducer, RecordReader, RecordWriter, Partitioner; access to context object passed by the framework that allow to get JobConf
parameters, set counters and report status; programming
style similar to the Java and C++ APIs: developers define
classes that are instantiated by the framework, with methods
also called by the framework (compare this to the Streaming approach, where you have to manually handle the entire
key/value stream); CPython implementation: any Python
module can be used, either pure Python or C/C++ extension (this is not possible with Jython); HDFS access from
Python.

821

MB = float(2**20)
def treewalker(fs, root_info):
yield root_info
if root_info["kind"] == "directory":
for info in fs.list_directory(root_info["name"]):
for item in treewalker(fs, info):
yield item
def usage_by_bs(fs, root):
stats = {}
root_info = fs.get_path_info(root)
for info in treewalker(fs, root_info):
if info["kind"] == "directory":
continue
bs = int(info["block_size"])
size = int(info["size"])
stats[bs] = stats.get(bs, 0) + size
return stats
def main(argv):
fs = hdfs("localhost", 9000)
root = fs.working_directory()
for k, v in usage_by_bs(fs, root).iteritems():
print "%.1f %d" % (k/MB, v)
fs.close()

Figure 4: A more complex, although somewhat contrived, HDFS example. Here, a directory tree is
walked recursively and statistics on file system usage by block size are built.

4.1

A Simple HDFS Example

One of the strengths of Python is interactivity. Fig. 3 shows


an almost one-liner that copies a file from the local file system to HDFS. Of course, in this case the Hadoop HDFS shell
equivalent would be more compact, but a full API provides
more flexibility.
The Pydoop HDFS interface is written as a Boost.Python
wrapper around the C libhdfs, itself a JNI wrapping of the
Java code, so it essentially supports the same array of features. We also added few extensions, such as a readline
method for HDFS files, in order to provide the Python user
with an interface that is reasonably close to that of standard
Python file objects.
Fig. 4 shows a more detailed, even though somewhat contrived, example of a script that walks through a directory
tree and builds statistics of HDFS usage by block size. This
is an example of a useful operation that cannot be performed
with the Hadoop HDFS shell and requires writing only a
small amount of Python code.

from pydoop.pipes import Mapper, Reducer, Factory, runTask

class WordCountMapper(Mapper):

class WordCountMapper(Mapper):

def __init__(self, context):


super(WordCountMapper, self).__init__(context)
context.setStatus("initializing")
self.inputWords = context.getCounter(WC, INPUT_WORDS)

def map(self, context):


words = context.getInputValue().split()
for w in words:
context.emit(w, "1")

def map(self, context):


k = context.getInputKey()
words = context.getInputValue().split()
for w in words:
context.emit(w, "1")
context.incrementCounter(self.inputWords, len(words))

class WordCountReducer(Reducer):
def reduce(self, context):
s = 0
while context.nextValue():
s += int(context.getInputValue())
context.emit(context.getInputKey(), str(s))

class WordCountReducer(Reducer):

runTask(Factory(WordCountMapper, WordCountReducer))

def __init__(self, context):


super(WordCountReducer, self).__init__(context)
context.setStatus("initializing")
self.outputWords = context.getCounter(WC, OUTPUT_WORDS)

Figure 5: The simplest implementation of the classic


word count example in Pydoop.

def reduce(self, context):


s = 0
while context.nextValue():
s += int(context.getInputValue())
context.emit(context.getInputKey(), str(s))
context.incrementCounter(self.outputWords, 1)

4.2

A Simple MapReduce Example

Fig. 5 shows the simplest implementation of the classic


word count example in Pydoop. All communication with the
framework is handled through the context object. Specifically, through the context, Mapper objects get input key/value
pairs (in this case the key, equal to the byte offset of the
input file, is not needed) and emit intermediate key/value
pairs; reducers get intermediate keys along with their associated set of values and emit output key/value pairs.
Fig. 6 shows how to include counter and status updates in
the Mapper and Reducer. Counters are defined by the user:
they are usually associated with relevant application parameters (in this case, the mapper counts input words and the
reducer counts output words). Status updates are simply
arbitrary text messages that the application reports back to
the framework. As shown in the code snippet, communication of counter and status updates happens through the
context object. Fig. 7 shows how to implement a RecordReader for the word count application. The RecordReader
processes the InputSplit, a raw bytes chunk from an input
file, and divides it into key/value pairs to be fed to the Mapper. In this example, we show a Python reimplementation of
Hadoops default RecordReader, where keys are byte offsets
with respect to the whole file and values are text lines: this
RecordReader is therefore not specific to word count (although it is the one that the word count Mapper expects).
Note that we are using the readline method we added to
Pydoop HDFS file objects.
Finally, fig. 8 shows how to implement the RecordWriter
and Partitioner for the word count application. Again, these
are Python versions of the corresponding standard generalpurpose Hadoop components, so they are actually not specific to word count. The RecordWriter is responsible for
writing key/value pairs to output files: the standard, which
is replicated here, is to write one key/value pair per line,
separated by a configurable separator that defaults to the
tab character. This example also shows how to use the JobConf object to retrieve configuration parameters: in this
case we are reading standard Hadoop parameters, but an
application is free to define any number of arbitrary options

822

Figure 6: A word count implementation that includes counters and status updates: through these,
the application writer can send information on its
progress to the framework.

class WordCountReader(RecordReader):
def __init__(self, context):
super(WordCountReader, self).__init__()
self.isplit = InputSplit(context.getInputSplit())
self.host, self.port, self.fpath = split_hdfs_path(
self.isplit.filename)
self.fs = hdfs(self.host, self.port)
self.file = self.fs.open_file(self.fpath, os.O_RDONLY)
self.file.seek(self.isplit.offset)
self.bytes_read = 0
if self.isplit.offset > 0:
# read by reader of previous split
discarded = self.file.readline()
self.bytes_read += len(discarded)
def next(self):
if self.bytes_read > self.isplit.length:
return (False, "", "")
key = struct.pack(">q", self.isplit.offset+self.bytes_read)
record = self.file.readline()
if record == "":
return (False, "", "")
self.bytes_read += len(record)
return (True, key, record)
def getProgress(self):
return min(float(self.bytes_read)/self.isplit.length, 1.0)

Figure 7: Word Count RecordReader example. The


RecordReader converts the byte-oriented view of
the InputSplit to the record-oriented view needed
by the Mapper. Here, we show some code snippets
from a plug-in replacement of Hadoops standard
Java LineRecordReader, where keys are byte offsets
with respect to the whole file and values (records)
are text lines.

sisted of N lines of text containing a single integer that represents the amount of data that has to be generated. We
made the word list available to all machines via the Hadoop
Distributed Cache. Word sampling from the list was uniform, while we used a Gaussian line length with mean value
equal to 120 characters and unitary variance. Since each
mapper processes a subset of the input records, N was set
equal to a multiple of the maximum map task capacity (that
is, the number of nodes multiplied by the number of concurrent map tasks per node).

class WordCountWriter(RecordWriter):
def __init__(self, context):
super(WordCountWriter, self).__init__(context)
jc = context.getJobConf()
jc_configure_int(self, jc, "mapred.task.partition", "part")
jc_configure(self, jc, "mapred.work.output.dir", "outdir")
jc_configure(self, jc, "mapred.textoutputformat.separator",
"sep", "\t")
outfn = "%s/part-%05d" % (self.outdir, self.part)
host, port, fpath = split_hdfs_path(outfn)
self.fs = hdfs(host, port)
self.file = self.fs.open_file(fpath, os.O_WRONLY)
def emit(self, key, value):
self.file.write("%s%s%s\n" % (key, self.sep, value))
class WordCountPartitioner(Partitioner):
def partition(self, key, numOfReduces):
reducer_id = (hash(key) & sys.maxint) % numOfReduces
return reducer_id

Figure 8: RecordWriter and Partitioner examples


for word count. The former is responsible for writing key/value pairs to output files, while the latter
decides how to assign keys to reducers. Again, these
are Python implementations of the corresponding
standard components. Note how the JobConf is
used to retrieve application parameters.
whose values are read from the applications configuration
file (exactly as in C++ pipes applications). The Partitioner
decides how to assign keys to reducers: again, in this example we show the standard way of doing this, by means of a
hash function of the key itself. The framework passes the
total number of reduce tasks to the partition method as the
second argument.

5.

COMPARISON

In the previous sections, we compared Pydoop to the other


solutions for Hadoop application writing in terms of characteristics such as convenience, development speed, flexibility,
etc. In this section we will present a performance comparison, obtained by running the classical word count example
in the different implementations. We chose word count for
two main reasons: it is a well-known, representative MapReduce example; it is simple enough to make comparisons between different languages sufficiently fair.
We ran our tests on a cluster of 48 identical machines, each
one equipped with two dual core 1.8 GHz AMD Opterons,
4 GB of RAM and two 160 GB hard disks (one dedicated
to the operating system and the other one to HDFS and
MapReduce temporary local directories), connected trough
Gb Ethernet.
To run the word count example, we generated 20 GB of random text data by sampling from a list of English words for
spell checkers [9]. Specifically, we merged together englishwords.* lists from the SCOWL package including levels 10
through 70, for a total of 144,577 words (each word in the
final 20 GB database appeared about 15 thousand times).
The data generation itself has been developed as a Pydoop
map-only MapReduce application. The input stream con-

823

Due to the relatively low amount of memory available on the


test cluster, and to the fact that only one disk was available,
we set the maximum number of concurrent tasks (both map
and reduce) to two in each node, obtaining a total capacity
of 96. Consequently, we configured the data generation application to distributed the random text over 96 files, with
an HDFS block size of 128 MB. We ran the actual word
count application, in all cases, with 192 mappers and 90
reducers. All tests described here were run with Hadoop
0.20.1, Java 1.6.0 05, Python 2.5.2 and Jython 2.2.1 (we did
not use Jython 2.5 because it no longer supports jythonc).
Our main goal was to compare Pydoop to the two other main
options for Python Hadoop application writing: Jython and
Hadoop Streaming (in the latter case, we used Python executable scripts for the mapper and the reducer). In order to
see how Pydoop compares with the two main other language
frameworks, we also compared the Pydoop implementation
of word count to the Java and the C++ ones.
Since there is no way to add a combiner executable to Streaming (although you can provide a Java one), we ran two separate test sessions: in the first one, we ran the official word
count example (with the combiner class set equal to the reducer class) in the Java, C++, Pydoop and Jython versions;
in the second one, we ran a word count without combiner
for all implementations included in the previous session plus
Streaming (the mapper and reducer scripts have been implemented in Python).
Fig. 9 shows results for the first test session (timings averaged over five iterations of the same run): Java is the fastest,
second best is C++, Pydoop and Jython are the slowest (Pydoop yielded slightly better results, but the two are comparable within the error range). The fact that Java has the
best performances is not surprising: this is the standard
way of writing Hadoop applications, where everything runs
inside the framework. The C++ application, on the other
hand, communicates with the Java framework through an
external socket-based protocol, adding overhead. Pydoop,
being in turn built upon the C++ code, obviously adds more
overhead. Moreover, in general, C++ code is expected to
be much faster than its Python equivalent, even for simple
tasks as this one. The better integration of Jython with the
framework is probably counterbalanced by the fact that it
is generally slower than CPython.
Fig. 10 shows results for the second test session (again, timings are averaged over five iterations of the same run). For
all implementations except Streaming, performance ranks
are similar to those of the previous session: again, Jythons
and Pydoops performances are equal within the error range,

class WordCountReducer(Reducer):
def reduce(self, context):
context.emit(context.input_key,
str(sum(context.itervalues())))

with combiner -- average total elapsed times


2.5
2.0

Figure 11: Pydoop word count reducer code written


according to the new planned interface.

1.5

this time slightly better for the former. The Hadoop streaming implementation is considerably slower, mostly because
of the plain text serialization for each key/value pair.

1.0
0.5

6.
0.0

java

c++

pydoop

jython

Figure 9: Total elapsed times, scaled by the average


Java elapsed time (238 s), for a word count on 20 GB
of data with Java, C++, Pydoop and Jython implementations. In this case we ran the official example,
where the combiner class is set equal to the reducer
class. We used 96 CPU cores distributed over 48
computing nodes. Timings are averaged over five
iterations of the same run.

no combiner -- average total elapsed times


5
4

Although it is already being used internally in production,


Pydoop is still under development. One of the most relevant enhancements we plan to add in the near future is a
more pythonic interface to objects and methods, in order
to help Python programmers get familiar with it more easily. The features we plan to add include property access to
keys and values and Python-style iterators for traversing the
set of values associated with a given key. Fig. 11 shows how
the word count reducer code will change after the aforementioned features will be added.

3
2
1
0

java

c++

jython

CONCLUSIONS AND FUTURE WORK

Pydoop is a Python MapReduce and HDFS API for Hadoop


that allows object-oriented Java-style MapReduce programming in CPython. It constitutes a valid alternative to the
other two main solutions for Python Hadoop programming,
Jython and Hadoop Streaming-driven Python scripts. With
respect to Jython, Pydoop has the advantage of being a
CPython package, which means application writers have access to all Python libraries, either built-in or third-party,
including any C/C++ extension module. Performance-wise,
there is no significant difference between Pydoop and Jython.
With respect to Streaming, there are several advantages: application writing is done through a consistent API that handles communication with the framework through the context
object, while in Streaming you are forced to manually handle
key/value passing through the standard input and output of
the executable mapper and reducer script; you have access
to almost all MapReduce components, while in Streaming
you can only write code for the map and reduce functions;
the text protocol used by Streaming imposes limits on data
types; finally, performances with Streaming are considerably
worse. Moreover, Pydoop also provides access to HDFS, allowing to perform nontrivial tasks with the compactness and
speed of development of Python.

pydoop streaming

Figure 10: Total elapsed times, scaled by the average Java elapsed time (338 s), for a word count
on 20 GB of data with Java, C++, Jython, Pydoop
and Hadoop Streaming (Python scripts) implementations. In this case we did not set a combiner class.
We used 96 CPU cores distributed over 48 computing nodes. Timings are averaged over five iterations
of the same run.

7.

AVAILABILITY

Pydoop is available at http://pydoop.sourceforge.net.

8.

ACKNOWLEDGMENTS

The work described here was partially supported by the Italian Ministry of Research under the CYBERSAR project.

9.

REFERENCES

[1] Amazon Elastic MapReduce.


http://aws.amazon.com/elasticmapreduce.

824

[2] Applications and organizations using hadoop.


http://wiki.apache.org/hadoop/PoweredBy.
[3] Disco. http://discoproject.org.
[4] Dumbo. http://wiki.github.com/klbostee/dumbo.
[5] Hadoop. http://hadoop.apache.org.
[6] Hadoop + Python = Happy.
http://code.google.com/p/happy.
[7] Hadoop Common Credits.
http://hadoop.apache.org/common/credits.html.
[8] Hadoop Distributed File System (HDFS) APIs in perl,
python, ruby and php.
http://wiki.apache.org/hadoop/HDFS-APIs.
[9] Kevins Word List Page.
http://wordlist.sourceforge.net.
[10] NumPy. http://numpy.scipy.org.
[11] Octopy Easy MapReduce for Python.
http://code.google.com/p/octopy.
[12] Starfish. http://rufy.com/starfish/doc.
[13] The Jython Project. http://www.jython.org.
[14] Thrift. http://incubator.apache.org/thrift.
[15] D. Abrahams and R. Grosse-Kunstleve. Building
hybrid systems with Boost.Python. C/C++ Users
Journal, 21(7):2936, 2003.
[16] J. Dean and S. Ghemawat. MapReduce: simplified
data processing on large clusters. In OSDI 04: 6th
Symposium on Operating Systems Design and
Implementation, 2004.
[17] S. Ghemawat, H. Gobioff, and S. Leung. The Google
file system. ACM SIGOPS Operating Systems Review,
37(5):43, 2003.
[18] S. Leo, P. Anedda, M. Gaggero, and G. Zanetti. Using
virtual clusters to decouple computation and data
management in high throughput analysis applications.
In Proceedings of the 18th Euromicro Conference on
Parallel, Distributed and Network-based Processing,
Pisa, Italy, 17-19 February 2010, pages 411415, 2010.

825