Você está na página 1de 8

Exploring Large-Data Issues in the Curriculum:

A Case Study with MapReduce

Jimmy Lin
The iSchool, College of Information Studies
Laboratory for Computational Linguistics and Information Processing
University of Maryland, College Park
jimmylin@umd.edu

Abstract fectiveness of simple pattern-matching techniques


when applied to large quantities of data (Brill et
This paper describes the design of a pilot re-
al., 2001). More recently, this line of argumenta-
search and educational effort at the Univer-
sity of Maryland centered around technologies
tion has been echoed in experiments with large-scale
for tackling Web-scale problems. In the con- language models. Brants et al. (2007) show that
text of a “cloud computing” initiative lead by for statistical machine translation, a simple smooth-
Google and IBM, students and researchers are ing method (dubbed Stupid Backoff) approaches the
provided access to a computer cluster running quality of Kneser-Ney Smoothing as the amount of
Hadoop, an open-source Java implementation training data increases, and with the simple method
of Google’s MapReduce framework. This one can process significantly more data.
technology provides an opportunity for stu-
Given these observations, it is important to in-
dents to explore large-data issues in the con-
text of a course organized around teams of tegrate discussions of large-data issues into any
graduate and undergraduate students, in which course on human language technology. Most ex-
they tackle open research problems in the hu- isting courses focus on smaller-sized problems and
man language technologies. This design rep- datasets that can be processed on students’ personal
resents one attempt to bridge traditional in- computers, making them ill-prepared to cope with
struction with real-world, large-data research the vast quantities of data in operational environ-
challenges.
ments. Even when larger datasets are leveraged in
the classroom, they are mostly used as static re-
1 Introduction sources. Thus, students experience a disconnect as
Over the past couple of decades, the field of compu- they transition from a learning environment to one
tational linguistics, and more broadly, human lan- where they work on real-world problems.
guage technologies, has seen the emergence and Nevertheless, there are at least two major chal-
later dominance of empirical techniques and data- lenges associated with explicit treatment of large-
driven research. Concomitant with this trend is the data issues in an HLT curriculum:
requirement of systems and algorithms to handle
• The first concerns resources: it is unclear where
large quantities of data. Banko and Brill (2001)
one might acquire the hardware to support ed-
were among the first to demonstrate the importance
ucational activities, especially if such activities
of dataset size as a significant factor governing pre-
are in direct competition with research.
diction accuracy in a supervised machine learning
task. In fact, they argue that size of training set • The second involves complexities inherently
is perhaps more important than the choice of ma- associated with parallel and distributed pro-
chine learning algorithm itself. Similarly, exper- cessing, currently the only practical solution to
iments in question answering have shown the ef- large-data problems. For any course, it is diffi-

54
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics (TeachCL-08), pages 54–61,
Columbus, Ohio, USA, June 2008. 2008
c Association for Computational Linguistics
cult to retain focus on HLT-relevant problems, As part of this initiative, IBM and Google have
since the exploration of large-data issues ne- dedicated a large cluster of several hundred ma-
cessitates (time-consuming) forays into parallel chines for use by faculty and students at the partic-
and distributed computing. ipating institutions. The cluster takes advantage of
Hadoop, an open-source implementation of MapRe-
This paper presents a case study that grapples duce in Java.1 By making these resources available,
with the issues outlined above. Building on previ- Google and IBM hope to encourage faculty adop-
ous experience with similar courses at the Univer- tion of cloud computing in their research and also
sity of Washington (Kimball et al., 2008), I present integration of the technology into the curriculum.
a pilot “cloud computing” course currently under- MapReduce builds on the observation that many
way at the University of Maryland that leverages a information processing tasks have the same basic
collaboration with Google and IBM, through which structure: a computation is applied over a large num-
students are given access to hardware resources. To ber of records (e.g., Web pages) to generate par-
further alleviate the first issue, research is brought tial results, which are then aggregated in some fash-
into alignment with education by structuring a team- ion. Naturally, the per-record computation and ag-
oriented, project-focused course. The core idea is to gregation function vary according to task, but the ba-
organize teams of graduate and undergraduate stu- sic structure remains fixed. Taking inspiration from
dents focused on tackling open research problems in higher-order functions in functional programming,
natural language processing, information retrieval, MapReduce provides an abstraction at the point of
and related areas. Ph.D. students serve as leaders these two operations. Specifically, the programmer
on projects related to their research, and are given defines a “mapper” and a “reducer” with the follow-
the opportunity to serve as mentors to undergradu- ing signatures:
ate and masters students.
Google’s MapReduce programming framework is map: (k1 , v1 ) → [(k2 , v2 )]
an elegant solution to the second issue raised above. reduce: (k2 , [v2 ]) → [(k3 , v3 )]
By providing a functional abstraction that isolates Key/value pairs form the basic data structure in
the programmer from parallel and distributed pro- MapReduce. The mapper is applied to every input
cessing issues, students can focus on solving the key/value pair to generate an arbitrary number of in-
actual problem. I first provide the context for this termediate key/value pairs. The reducer is applied to
academic–industrial collaboration, and then move all values associated with the same intermediate key
on to describe the course setup. to generate output key/value pairs. This two-stage
processing structure is illustrated in Figure 1.
2 Cloud Computing and MapReduce Under the framework, a programmer need only
In October 2007, Google and IBM jointly an- provide implementations of the mapper and reducer.
nounced the Academic Cloud Computing Initiative, On top of a distributed file system (Ghemawat et al.,
with the goal of helping both researchers and stu- 2003), the runtime transparently handles all other
dents address the challenges of “Web-scale” com- aspects of execution, on clusters ranging from a
puting. The initiative revolves around Google’s few to a few thousand nodes. The runtime is re-
MapReduce programming paradigm (Dean and sponsible for scheduling map and reduce workers
Ghemawat, 2004), which represents a proven ap- on commodity hardware assumed to be unreliable,
proach to tackling data-intensive problems in a dis- and thus is tolerant to various faults through a num-
tributed manner. Six universities were involved ber of error recovery mechanisms. The runtime also
in the collaboration at the outset: Carnegie Mellon manages data distribution, including splitting the in-
University, Massachusetts Institute of Technology, put across multiple map workers and the potentially
Stanford University, the University of California at very large sorting problem between the map and re-
Berkeley, the University of Maryland, and Univer- duce phases whereby intermediate key/value pairs
sity of Washington. I am the lead faculty at the Uni- must be grouped by key.
1
versity of Maryland on this project. http://hadoop.apache.org/

55
at the beginning of the semester, and are associated
input input input input
with different expectations and responsibilities. All
course material and additional details are available
map map map map on the course homepage.2

3.1 Objectives and Goals


Barrier: group values by keys
I identified a list of desired competencies for stu-
reduce reduce reduce
dents to acquire and refine throughout the course:
• Understand and be able to articulate the chal-
lenges associated with distributed solutions to
output output output
large-scale problems, e.g., scheduling, load
balancing, fault tolerance, memory and band-
Figure 1: Illustration of the MapReduce framework: the width limitations, etc.
“mapper” is applied to all input records, which generates
results that are aggregated by the “reducer”. • Understand and be able to explain the concepts
behind MapReduce as one framework for ad-
dressing the above issues.
The biggest advantage of MapReduce from a ped-
• Understand and be able to express well-known
agogical point of view is that it allows an HLT
algorithms (e.g., PageRank) in the MapReduce
course to retain its focus on applications. Divide-
framework.
and-conquer algorithms running on multiple ma-
chines are currently the only effective strategy for • Understand and be able to reason about engi-
tackling Web-scale problems. However, program- neering tradeoffs in alternative approaches to
ming parallel and distributed systems is a difficult processing large datasets.
topic for students to master. Due to communica- • Gain in-depth experience with one research
tion and synchronization issues, concurrent opera- problem in Web-scale information processing
tions are notoriously challenging to reason about— (broadly defined).
unanticipated race conditions are hard to detect and
With respect to the final bullet point, the students
even harder to debug. MapReduce allows the pro-
are expected to acquire the following abilities:
grammer to offload these problems (no doubt im-
portant, but irrelevant from the perspective of HLT) • Understand how current solutions to the par-
onto the runtime, which handles the complexities as- ticular research problem can be cast into the
sociated with distributed processing on large clus- MapReduce framework.
ters. The functional abstraction allows a student to • Be able to explain what advantages the MapRe-
focus on problem solving, not managing the details duce framework provides over existing ap-
of error recovery, data distribution, etc. proaches (or disadvantages if a MapReduce
formulation turns out to be unsuitable for ex-
3 Course Design
pressing the problem).
This paper describes a “cloud computing” course at • Articulate how adopting the MapReduce
the University of Maryland being offered in Spring framework can potentially lead to advances in
2008. The core idea is to assemble small teams of the state of the art by enabling processing not
graduate and undergraduate students to tackle re- possible before.
search problems, primarily in the areas of informa-
tion retrieval and natural language processing. Ph.D. I assumed that all students have a strong foun-
students serve as team leaders, overseeing small dation in computer science, which was operational-
groups of masters and undergraduates on topics re- ized in having completed basic courses in algo-
lated to their doctoral research. The roles of “team rithms, data structures, and programming languages
2
leader” and “team member” are explicitly assigned http://www.umiacs.umd.edu/∼jimmylin/cloud-computing/

56
Week Monday Wednesday introduced to the MapReduce programming frame-
1 work. Material was adapted from slides developed
2 Hadoop Boot Camp
by Christophe Bisciglia and his colleagues from
3
Google, who have delivered similar content in var-
4
ious formats.3 As it was assumed that all students
5 Proposal Presentations
had strong foundations in computer science, the
6
Project Meetings: pace of the lectures was brisk. The themes of the
Ph
Phase I
7
five boot camp sessions are listed below:
8
• Introduction to parallel/distributed processing
9
G
Guest
t Speakers
S k
10 • From functional programming to MapReduce
11
Project Meetings:
Phase II
and the Google File System (GFS)
12 • “Hello World” MapReduce lab
13
14
• Graph algorithms with MapReduce
Final Project
Presentations
15 • Information retrieval with MapReduce

A brief overview of parallel and distributed pro-


Figure 2: Overview of course schedule.
cessing provides a natural transition into abstrac-
tions afforded by functional programming, the inspi-
(in practice, this was trivially met for the graduate ration behind MapReduce. That in turn provides the
students, who all had undergraduate degrees in com- context to introduce MapReduce itself, along with
puter science). I explicitly made the decision that the distributed file system upon which it depends.
previous courses in parallel programming, systems, The final two lectures focus on specific case stud-
or networks was not required. Finally, prior experi- ies of MapReduce applied to graph analysis and in-
ence with natural language processing, information formation retrieval. The first covers graph search
retrieval, or related areas was not assumed. How- and PageRank, while the second covers algorithms
ever, strong competency in Java programming was a for information retrieval. With the exception of the
strict requirement, as the Hadoop implementation of “Hello World” lab session, all lecture content was
MapReduce is based in Java. delivered at the conceptual level, without specific
In the project-oriented setup, the team leaders reference to the Hadoop API and implementation
(i.e., Ph.D. students) have additional roles to play. details (see Section 5 for discussion). The boot
One of the goals of the course is to give them experi- camp is capped off with a programming exercise
ence in mentoring more junior colleagues and man- (implementation of PageRank) to ensure that stu-
aging a team project. As such, they were expected to dents have a passing knowledge of MapReduce con-
acquire real-world skills in project organization and cepts in general and the Hadoop API in particular.
management. Concurrent with the boot camp, team leaders are
expected to develop a detailed plan of research:
3.2 Schedule and Major Components what they hope to accomplish, specific tasks that
would lead to the goals, and possible distribution of
As designed, the course spans a standard fifteen
those tasks across team members. I recommend that
week semester, meeting twice a week (Monday and
each project be structured into two phases: the first
Wednesday) for one hour and fifteen minutes each
phase focusing on how existing solutions might be
session. The general setup is shown in Figure 2. As
recast into the MapReduce framework, the second
this paper goes to press (mid-April), the course just
phase focusing on interesting extensions enabled by
concluded Week 11.
MapReduce. In addition to the detailed research
During the first three weeks, all students are im-
3
mersed in a “Hadoop boot camp”, where they are http://code.google.com/edu/parallel/

57
plan, the leaders are responsible for organizing intro- the detailed research plan at the beginning of the
ductory material (papers, tutorials, etc.) since team semester. The entire team is responsible for three
members are not expected to have any prior experi- checkpoint deliverables throughout the course: an
ence with the research topic. initial oral presentation outlining their plans, a short
The majority of the course is taken up by the re- interim progress report at roughly the midpoint of
search project itself. The Monday class sessions the semester, and a final oral presentation accompa-
are devoted to the team project meetings, and the nied by a written report at the end of the semester.
team leader is given discretion on how this is man- On a weekly basis, I request from each stu-
aged. Typical activities include evaluation of deliv- dent a status report delivered as a concise email: a
erables (code, experimental results, etc.) from the paragraph-length outline of progress from the previ-
previous week and discussions of plans for the up- ous week and plans for the following week. This,
coming week, but other common uses of the meeting coupled with my observations during each project
time include whiteboard sessions and code review. meeting, provides the basis for continuous evalua-
During the project meetings I circulate from group tion of student performance.
to group to track progress, offer helpful suggestions,
and contribute substantially if possible. 4 Course Implementation
To the extent practical, the teams adopt standard Currently, 13 students (7 Ph.D., 3 masters, 3 under-
best practices for software development. Students graduates) are involved in the course, working on
use Eclipse as the development environment and six different projects. Last fall, as planning was
take advantage of a plug-in that provides a seamless underway, Ph.D. students from the Laboratory for
interface to the Hadoop cluster. Code is shared via Computational Linguistics and Information Process-
Subversion, with both project-specific repositories ing at the University of Maryland were recruited
and a course-wide repository for common libraries. as team leaders. Three of them agreed, developing
A wiki is also provided as a point of collaboration. projects around their doctoral research—these repre-
Concurrent with the project meetings on Mon- sent cases with maximal alignment of research and
days, a speaker series takes place on Wednesdays. educational goals. In addition, the availability of this
Attendance for students is required, but otherwise opportunity was announced on mailing lists, which
the talks are open to the public. One of the goals generated substantial interest. Undergraduates were
for these invited talks is to build an active commu- recruited from the Computer Science honors pro-
nity of researchers interested in large datasets and gram; since it is a requirement for those students to
distributed processing. Invited talks can be clas- complete an honors project, this course provided a
sified into one of two types: infrastructure-focused suitable vehicle for satisfying that requirement.
and application-focused. Examples of the first in- Three elements are necessary for a successful
clude alternative architectures for processing large project: interested students, an interesting research
datasets and dynamic provisioning of computing problem of appropriate scope, and the availability
services. Examples of the second include survey of data to support the work. I served as a broker
of distributed data mining techniques and Web-scale for all three elements, and eventually settled on five
sentiment analysis. It is not a requirement for the projects that satisfied all the desiderata (one project
talks to focus on MapReduce per se—rather, an em- was a later addition). As there was more interest
phasis on large-data issues is the thread that weaves than spaces available for team members, it was pos-
all these presentations together. sible to screen for suitable background and matching
interests. The six ongoing projects are as follows:
3.3 Student Evaluation
• Large-data statistical machine translation
At the beginning of the course, students are assigned
specific roles (team leader or team member) and • Construction of large latent-variable language
are evaluated according to different criteria (both in models
grade components and relative weights). • Resolution of name mentions in large email
The team leaders are responsible for producing archives

58
• Network analysis for enhancing biomedical The intensity of the boot camp was mitigated by
text retrieval the composition of the students. Since students were
• Text-background separation in children’s pic- self-selected and further screened by me in terms of
ture books their computational background, they represent the
highest caliber of students at the university. Further-
• High-throughput biological sequence align-
more, due to the novel nature of the material, stu-
ment and processing
dents were highly motivated to rapidly acquire what-
Of the six projects, four of them fall squarely in ever knowledge was necessary outside the class-
the area of human language technology: the first two room. In reality, the course design forced students
are typical of problems in natural language process- to spend the first few weeks of the project simulta-
ing, while the second two are problems in informa- neously learning about the research problem and the
tion retrieval. The final two projects represent at- details of the Hadoop framework. However, this did
tempts to push the boundaries of the MapReduce not appear to be a problem.
paradigm, into image processing and computational Another interesting design choice is the mixing
biology, respectively. Short project descriptions can of students with different backgrounds in the same
be found on the course homepage. classroom environment. Obviously, the graduate
students had stronger computer science backgrounds
5 Pedagogical Discussion
than the undergraduates overall, and the team lead-
The design of any course is an exercise in tradeoffs, ers had far more experience on the particular re-
and this pilot project is no exception. In this section, search problem than everyone else by design. How-
I will attempt to justify course design decisions and ever, this was less an issue than one would have ini-
discuss possible alternatives. tially thought, partially due to the selection of the
At the outset, I explicitly decided against a “tradi- students. Since MapReduce requires a different ap-
tional” course format that would involve carefully- proach to problem solving, significant learning was
paced delivery of content with structured exercises required from everyone, independent of prior expe-
(e.g., problem sets or labs). Such a design would rience. In fact, prior knowledge of existing solutions
perhaps be capped off with a multi-week final may in some cases be limiting, since it precludes a
project. The pioneering MapReduce course at the fresh approach to the problem.
University of Washington represents an example of
this design (Kimball et al., 2008), combining six 6 Course Evaluation
weeks of standard classroom instruction with an op-
tional four week final project. As an alternative, I or- Has the course succeeded? Before this question can
ganized my course around the research project. This be meaningfully answered, one needs to define mea-
choice meant that the time devoted to direct instruc- sures for quantifying success. Note that the evalua-
tion on foundational concepts was very limited, i.e., tion of the course is distinct from the evaluation of
the three-week boot camp. student performance (covered in Section 3.3). Given
One consequence of the boot-camp setup is some the explicit goal of integrating research and educa-
disconnect between the lecture material and imple- tion, I propose the following evaluation criteria:
mentation details. Students were expected to rapidly
translate high-level concepts into low-level pro- • Significance of research findings, as measured
gramming constructs and API calls without much by the number of publications that arise directly
guidance. There was only one “hands on” session or indirectly from this project.
in the boot camp, focusing on more mundane is-
• Placement of students, e.g., internships and
sues such as installation, configuration, connecting
permanent positions, or admission to graduate
to the server, etc. Although that session also in-
programs (for undergraduates).
cluded overview of a simple Hadoop program, that
by no means was sufficient to yield in-depth under- • Number of projects with sustained research ac-
standing of the framework. tivities after the conclusion of the course.

59
• Amount of additional research support from 7 Conclusion
other funding agencies (NSF, DARPA, etc.)
This paper describes the design of an integrated re-
for which the projects provided preliminary re-
search and educational initiative focused on tackling
sults.
Web-scale problems in natural language processing
Here I provide an interim assessment, as this pa- and information retrieval using MapReduce. Pre-
per goes to press in mid-April. Preliminary results liminary assessment indicates that this project rep-
from the projects have already yielded two sepa- resents one viable approach to bridging classroom
rate publications: one on statistical machine trans- instruction and real-world research challenges. With
lation (Dyer et al., 2008), the other on information the advent of clusters composed of commodity ma-
retrieval (Elsayed et al., 2008). In terms of student chines and “rent-a-cluster” services such as Ama-
placement, I believe that experience from this course zon’s EC2,4 I believe that large-data issues can be
has made several students highly attractive to com- practically incorporated into an HLT curriculum at a
panies such as Google, Yahoo, and Amazon—both reasonable cost.
for permanent positions and summer internships. It
is far too early to have measurable results with re- Acknowledgments
spect to the final two criteria, but otherwise prelim- I would like to thank the generous hardware sup-
inary assessment appears to support the overall suc- port of IBM and Google via the Academic Cloud
cess of this course. Computing Initiative. Specifically, thanks go out
In addition to the above discussion, it is also worth to Dennis Quan and Eugene Hung from IBM for
mentioning that the course is emerging as a nexus their tireless support of our efforts. This course
of cloud computing on the Maryland campus (and would not have been possible without the participa-
beyond), serving to connect multiple organizations tion of 13 enthusiastic, dedicated students, for which
that share in having large-data problems. Already, I feel blessed to have the opportunity to work with.
the students are drawn from a variety of academic In alphabetical order, they are: Christiam Camacho,
units on campus: George Caragea, Aaron Cordova, Chris Dyer, Tamer
Elsayed, Denis Filimonov, Chang Hu, Greg Jablon-
• The iSchool
ski, Alan Jackoway, Punit Mehta, Alexander Mont,
• Department of Computer Science Michael Schatz, and Hua Wei. Finally, I would like
• Department of Linguistics to thank Esther and Kiri for their kind support.

• Department of Geography
References
And cross-cut multiple research labs:
Michele Banko and Eric Brill. 2001. Scaling to very very
• The Institute for Advanced Computer Studies large corpora for natural language disambiguation. In
Proceedings of the 39th Annual Meeting of the As-
• The Laboratory for Computational Linguistics sociation for Computational Linguistics (ACL 2001),
and Information Processing pages 26–33, Toulouse, France.
• The Human-Computer Interaction Laboratory Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och,
and Jeffrey Dean. 2007. Large language models in
• The Center for Bioinformatics and Computa- machine translation. In Proceedings of the 2007 Joint
tional Biology Conference on Empirical Methods in Natural Lan-
guage Processing and Computational Natural Lan-
Off campus, there are ongoing collaborations guage Learning, pages 858–867, Prague, Czech Re-
with the National Center for Biotechnology In- public.
formation (NCBI) within the National Library of Eric Brill, Jimmy Lin, Michele Banko, Susan Dumais,
Medicine (NLM). Other information-based organi- and Andrew Ng. 2001. Data-intensive question an-
swering. In Proceedings of the Tenth Text REtrieval
zations around the Washington, D.C. area have also
4
expressed interest in cloud computing technology. http://aws.amazon.com/ec2

60
Conference (TREC 2001), pages 393–400, Gaithers-
burg, Maryland.
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce:
Simplified data processing on large clusters. In Pro-
ceedings of the 6th Symposium on Operating System
Design and Implementation (OSDI 2004), pages 137–
150, San Francisco, California.
Chris Dyer, Aaron Cordova, Alex Mont, and Jimmy Lin.
2008. Fast, easy, and cheap: Construction of statistical
machine translation models with MapReduce. In Pro-
ceedings of the Third Workshop on Statistical Machine
Translation at ACL 2008, Columbus, Ohio.
Tamer Elsayed, Jimmy Lin, and Douglas Oard. 2008.
Pairwise document similarity in large collections with
MapReduce. In Proceedings of the 46th Annual Meet-
ing of the Association for Computational Linguistics
(ACL 2008), Companion Volume, Columbus, Ohio.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Le-
ung. 2003. The Google File System. In Proceedings
of the 19th ACM Symposium on Operating Systems
Principles (SOSP-03), pages 29–43, Bolton Landing,
New York.
Aaron Kimball, Sierra Michels-Slettvet, and Christophe
Bisciglia. 2008. Cluster computing for Web-scale
data processing. In Proceedings of the 39th ACM
Technical Symposium on Computer Science Education
(SIGCSE 2008), pages 116–120, Portland, Oregon.

61

Você também pode gostar