Escolar Documentos
Profissional Documentos
Cultura Documentos
Jimmy Lin
The iSchool, College of Information Studies
Laboratory for Computational Linguistics and Information Processing
University of Maryland, College Park
jimmylin@umd.edu
54
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics (TeachCL-08), pages 54–61,
Columbus, Ohio, USA, June 2008.
2008
c Association for Computational Linguistics
cult to retain focus on HLT-relevant problems, As part of this initiative, IBM and Google have
since the exploration of large-data issues ne- dedicated a large cluster of several hundred ma-
cessitates (time-consuming) forays into parallel chines for use by faculty and students at the partic-
and distributed computing. ipating institutions. The cluster takes advantage of
Hadoop, an open-source implementation of MapRe-
This paper presents a case study that grapples duce in Java.1 By making these resources available,
with the issues outlined above. Building on previ- Google and IBM hope to encourage faculty adop-
ous experience with similar courses at the Univer- tion of cloud computing in their research and also
sity of Washington (Kimball et al., 2008), I present integration of the technology into the curriculum.
a pilot “cloud computing” course currently under- MapReduce builds on the observation that many
way at the University of Maryland that leverages a information processing tasks have the same basic
collaboration with Google and IBM, through which structure: a computation is applied over a large num-
students are given access to hardware resources. To ber of records (e.g., Web pages) to generate par-
further alleviate the first issue, research is brought tial results, which are then aggregated in some fash-
into alignment with education by structuring a team- ion. Naturally, the per-record computation and ag-
oriented, project-focused course. The core idea is to gregation function vary according to task, but the ba-
organize teams of graduate and undergraduate stu- sic structure remains fixed. Taking inspiration from
dents focused on tackling open research problems in higher-order functions in functional programming,
natural language processing, information retrieval, MapReduce provides an abstraction at the point of
and related areas. Ph.D. students serve as leaders these two operations. Specifically, the programmer
on projects related to their research, and are given defines a “mapper” and a “reducer” with the follow-
the opportunity to serve as mentors to undergradu- ing signatures:
ate and masters students.
Google’s MapReduce programming framework is map: (k1 , v1 ) → [(k2 , v2 )]
an elegant solution to the second issue raised above. reduce: (k2 , [v2 ]) → [(k3 , v3 )]
By providing a functional abstraction that isolates Key/value pairs form the basic data structure in
the programmer from parallel and distributed pro- MapReduce. The mapper is applied to every input
cessing issues, students can focus on solving the key/value pair to generate an arbitrary number of in-
actual problem. I first provide the context for this termediate key/value pairs. The reducer is applied to
academic–industrial collaboration, and then move all values associated with the same intermediate key
on to describe the course setup. to generate output key/value pairs. This two-stage
processing structure is illustrated in Figure 1.
2 Cloud Computing and MapReduce Under the framework, a programmer need only
In October 2007, Google and IBM jointly an- provide implementations of the mapper and reducer.
nounced the Academic Cloud Computing Initiative, On top of a distributed file system (Ghemawat et al.,
with the goal of helping both researchers and stu- 2003), the runtime transparently handles all other
dents address the challenges of “Web-scale” com- aspects of execution, on clusters ranging from a
puting. The initiative revolves around Google’s few to a few thousand nodes. The runtime is re-
MapReduce programming paradigm (Dean and sponsible for scheduling map and reduce workers
Ghemawat, 2004), which represents a proven ap- on commodity hardware assumed to be unreliable,
proach to tackling data-intensive problems in a dis- and thus is tolerant to various faults through a num-
tributed manner. Six universities were involved ber of error recovery mechanisms. The runtime also
in the collaboration at the outset: Carnegie Mellon manages data distribution, including splitting the in-
University, Massachusetts Institute of Technology, put across multiple map workers and the potentially
Stanford University, the University of California at very large sorting problem between the map and re-
Berkeley, the University of Maryland, and Univer- duce phases whereby intermediate key/value pairs
sity of Washington. I am the lead faculty at the Uni- must be grouped by key.
1
versity of Maryland on this project. http://hadoop.apache.org/
55
at the beginning of the semester, and are associated
input input input input
with different expectations and responsibilities. All
course material and additional details are available
map map map map on the course homepage.2
56
Week Monday Wednesday introduced to the MapReduce programming frame-
1 work. Material was adapted from slides developed
2 Hadoop Boot Camp
by Christophe Bisciglia and his colleagues from
3
Google, who have delivered similar content in var-
4
ious formats.3 As it was assumed that all students
5 Proposal Presentations
had strong foundations in computer science, the
6
Project Meetings: pace of the lectures was brisk. The themes of the
Ph
Phase I
7
five boot camp sessions are listed below:
8
• Introduction to parallel/distributed processing
9
G
Guest
t Speakers
S k
10 • From functional programming to MapReduce
11
Project Meetings:
Phase II
and the Google File System (GFS)
12 • “Hello World” MapReduce lab
13
14
• Graph algorithms with MapReduce
Final Project
Presentations
15 • Information retrieval with MapReduce
57
plan, the leaders are responsible for organizing intro- the detailed research plan at the beginning of the
ductory material (papers, tutorials, etc.) since team semester. The entire team is responsible for three
members are not expected to have any prior experi- checkpoint deliverables throughout the course: an
ence with the research topic. initial oral presentation outlining their plans, a short
The majority of the course is taken up by the re- interim progress report at roughly the midpoint of
search project itself. The Monday class sessions the semester, and a final oral presentation accompa-
are devoted to the team project meetings, and the nied by a written report at the end of the semester.
team leader is given discretion on how this is man- On a weekly basis, I request from each stu-
aged. Typical activities include evaluation of deliv- dent a status report delivered as a concise email: a
erables (code, experimental results, etc.) from the paragraph-length outline of progress from the previ-
previous week and discussions of plans for the up- ous week and plans for the following week. This,
coming week, but other common uses of the meeting coupled with my observations during each project
time include whiteboard sessions and code review. meeting, provides the basis for continuous evalua-
During the project meetings I circulate from group tion of student performance.
to group to track progress, offer helpful suggestions,
and contribute substantially if possible. 4 Course Implementation
To the extent practical, the teams adopt standard Currently, 13 students (7 Ph.D., 3 masters, 3 under-
best practices for software development. Students graduates) are involved in the course, working on
use Eclipse as the development environment and six different projects. Last fall, as planning was
take advantage of a plug-in that provides a seamless underway, Ph.D. students from the Laboratory for
interface to the Hadoop cluster. Code is shared via Computational Linguistics and Information Process-
Subversion, with both project-specific repositories ing at the University of Maryland were recruited
and a course-wide repository for common libraries. as team leaders. Three of them agreed, developing
A wiki is also provided as a point of collaboration. projects around their doctoral research—these repre-
Concurrent with the project meetings on Mon- sent cases with maximal alignment of research and
days, a speaker series takes place on Wednesdays. educational goals. In addition, the availability of this
Attendance for students is required, but otherwise opportunity was announced on mailing lists, which
the talks are open to the public. One of the goals generated substantial interest. Undergraduates were
for these invited talks is to build an active commu- recruited from the Computer Science honors pro-
nity of researchers interested in large datasets and gram; since it is a requirement for those students to
distributed processing. Invited talks can be clas- complete an honors project, this course provided a
sified into one of two types: infrastructure-focused suitable vehicle for satisfying that requirement.
and application-focused. Examples of the first in- Three elements are necessary for a successful
clude alternative architectures for processing large project: interested students, an interesting research
datasets and dynamic provisioning of computing problem of appropriate scope, and the availability
services. Examples of the second include survey of data to support the work. I served as a broker
of distributed data mining techniques and Web-scale for all three elements, and eventually settled on five
sentiment analysis. It is not a requirement for the projects that satisfied all the desiderata (one project
talks to focus on MapReduce per se—rather, an em- was a later addition). As there was more interest
phasis on large-data issues is the thread that weaves than spaces available for team members, it was pos-
all these presentations together. sible to screen for suitable background and matching
interests. The six ongoing projects are as follows:
3.3 Student Evaluation
• Large-data statistical machine translation
At the beginning of the course, students are assigned
specific roles (team leader or team member) and • Construction of large latent-variable language
are evaluated according to different criteria (both in models
grade components and relative weights). • Resolution of name mentions in large email
The team leaders are responsible for producing archives
58
• Network analysis for enhancing biomedical The intensity of the boot camp was mitigated by
text retrieval the composition of the students. Since students were
• Text-background separation in children’s pic- self-selected and further screened by me in terms of
ture books their computational background, they represent the
highest caliber of students at the university. Further-
• High-throughput biological sequence align-
more, due to the novel nature of the material, stu-
ment and processing
dents were highly motivated to rapidly acquire what-
Of the six projects, four of them fall squarely in ever knowledge was necessary outside the class-
the area of human language technology: the first two room. In reality, the course design forced students
are typical of problems in natural language process- to spend the first few weeks of the project simulta-
ing, while the second two are problems in informa- neously learning about the research problem and the
tion retrieval. The final two projects represent at- details of the Hadoop framework. However, this did
tempts to push the boundaries of the MapReduce not appear to be a problem.
paradigm, into image processing and computational Another interesting design choice is the mixing
biology, respectively. Short project descriptions can of students with different backgrounds in the same
be found on the course homepage. classroom environment. Obviously, the graduate
students had stronger computer science backgrounds
5 Pedagogical Discussion
than the undergraduates overall, and the team lead-
The design of any course is an exercise in tradeoffs, ers had far more experience on the particular re-
and this pilot project is no exception. In this section, search problem than everyone else by design. How-
I will attempt to justify course design decisions and ever, this was less an issue than one would have ini-
discuss possible alternatives. tially thought, partially due to the selection of the
At the outset, I explicitly decided against a “tradi- students. Since MapReduce requires a different ap-
tional” course format that would involve carefully- proach to problem solving, significant learning was
paced delivery of content with structured exercises required from everyone, independent of prior expe-
(e.g., problem sets or labs). Such a design would rience. In fact, prior knowledge of existing solutions
perhaps be capped off with a multi-week final may in some cases be limiting, since it precludes a
project. The pioneering MapReduce course at the fresh approach to the problem.
University of Washington represents an example of
this design (Kimball et al., 2008), combining six 6 Course Evaluation
weeks of standard classroom instruction with an op-
tional four week final project. As an alternative, I or- Has the course succeeded? Before this question can
ganized my course around the research project. This be meaningfully answered, one needs to define mea-
choice meant that the time devoted to direct instruc- sures for quantifying success. Note that the evalua-
tion on foundational concepts was very limited, i.e., tion of the course is distinct from the evaluation of
the three-week boot camp. student performance (covered in Section 3.3). Given
One consequence of the boot-camp setup is some the explicit goal of integrating research and educa-
disconnect between the lecture material and imple- tion, I propose the following evaluation criteria:
mentation details. Students were expected to rapidly
translate high-level concepts into low-level pro- • Significance of research findings, as measured
gramming constructs and API calls without much by the number of publications that arise directly
guidance. There was only one “hands on” session or indirectly from this project.
in the boot camp, focusing on more mundane is-
• Placement of students, e.g., internships and
sues such as installation, configuration, connecting
permanent positions, or admission to graduate
to the server, etc. Although that session also in-
programs (for undergraduates).
cluded overview of a simple Hadoop program, that
by no means was sufficient to yield in-depth under- • Number of projects with sustained research ac-
standing of the framework. tivities after the conclusion of the course.
59
• Amount of additional research support from 7 Conclusion
other funding agencies (NSF, DARPA, etc.)
This paper describes the design of an integrated re-
for which the projects provided preliminary re-
search and educational initiative focused on tackling
sults.
Web-scale problems in natural language processing
Here I provide an interim assessment, as this pa- and information retrieval using MapReduce. Pre-
per goes to press in mid-April. Preliminary results liminary assessment indicates that this project rep-
from the projects have already yielded two sepa- resents one viable approach to bridging classroom
rate publications: one on statistical machine trans- instruction and real-world research challenges. With
lation (Dyer et al., 2008), the other on information the advent of clusters composed of commodity ma-
retrieval (Elsayed et al., 2008). In terms of student chines and “rent-a-cluster” services such as Ama-
placement, I believe that experience from this course zon’s EC2,4 I believe that large-data issues can be
has made several students highly attractive to com- practically incorporated into an HLT curriculum at a
panies such as Google, Yahoo, and Amazon—both reasonable cost.
for permanent positions and summer internships. It
is far too early to have measurable results with re- Acknowledgments
spect to the final two criteria, but otherwise prelim- I would like to thank the generous hardware sup-
inary assessment appears to support the overall suc- port of IBM and Google via the Academic Cloud
cess of this course. Computing Initiative. Specifically, thanks go out
In addition to the above discussion, it is also worth to Dennis Quan and Eugene Hung from IBM for
mentioning that the course is emerging as a nexus their tireless support of our efforts. This course
of cloud computing on the Maryland campus (and would not have been possible without the participa-
beyond), serving to connect multiple organizations tion of 13 enthusiastic, dedicated students, for which
that share in having large-data problems. Already, I feel blessed to have the opportunity to work with.
the students are drawn from a variety of academic In alphabetical order, they are: Christiam Camacho,
units on campus: George Caragea, Aaron Cordova, Chris Dyer, Tamer
Elsayed, Denis Filimonov, Chang Hu, Greg Jablon-
• The iSchool
ski, Alan Jackoway, Punit Mehta, Alexander Mont,
• Department of Computer Science Michael Schatz, and Hua Wei. Finally, I would like
• Department of Linguistics to thank Esther and Kiri for their kind support.
• Department of Geography
References
And cross-cut multiple research labs:
Michele Banko and Eric Brill. 2001. Scaling to very very
• The Institute for Advanced Computer Studies large corpora for natural language disambiguation. In
Proceedings of the 39th Annual Meeting of the As-
• The Laboratory for Computational Linguistics sociation for Computational Linguistics (ACL 2001),
and Information Processing pages 26–33, Toulouse, France.
• The Human-Computer Interaction Laboratory Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och,
and Jeffrey Dean. 2007. Large language models in
• The Center for Bioinformatics and Computa- machine translation. In Proceedings of the 2007 Joint
tional Biology Conference on Empirical Methods in Natural Lan-
guage Processing and Computational Natural Lan-
Off campus, there are ongoing collaborations guage Learning, pages 858–867, Prague, Czech Re-
with the National Center for Biotechnology In- public.
formation (NCBI) within the National Library of Eric Brill, Jimmy Lin, Michele Banko, Susan Dumais,
Medicine (NLM). Other information-based organi- and Andrew Ng. 2001. Data-intensive question an-
swering. In Proceedings of the Tenth Text REtrieval
zations around the Washington, D.C. area have also
4
expressed interest in cloud computing technology. http://aws.amazon.com/ec2
60
Conference (TREC 2001), pages 393–400, Gaithers-
burg, Maryland.
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce:
Simplified data processing on large clusters. In Pro-
ceedings of the 6th Symposium on Operating System
Design and Implementation (OSDI 2004), pages 137–
150, San Francisco, California.
Chris Dyer, Aaron Cordova, Alex Mont, and Jimmy Lin.
2008. Fast, easy, and cheap: Construction of statistical
machine translation models with MapReduce. In Pro-
ceedings of the Third Workshop on Statistical Machine
Translation at ACL 2008, Columbus, Ohio.
Tamer Elsayed, Jimmy Lin, and Douglas Oard. 2008.
Pairwise document similarity in large collections with
MapReduce. In Proceedings of the 46th Annual Meet-
ing of the Association for Computational Linguistics
(ACL 2008), Companion Volume, Columbus, Ohio.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Le-
ung. 2003. The Google File System. In Proceedings
of the 19th ACM Symposium on Operating Systems
Principles (SOSP-03), pages 29–43, Bolton Landing,
New York.
Aaron Kimball, Sierra Michels-Slettvet, and Christophe
Bisciglia. 2008. Cluster computing for Web-scale
data processing. In Proceedings of the 39th ACM
Technical Symposium on Computer Science Education
(SIGCSE 2008), pages 116–120, Portland, Oregon.
61