Escolar Documentos
Profissional Documentos
Cultura Documentos
COMPUTERS
Information Security
EDITED BY
MARVIN V. ZELKOWITZ
Department of Computer Science
and Institute for Advanced Computer Studies
University of Maryland
College Park, Maryland
VOLUME 60
ISBN: 0-12-012160-3
ISSN (Series): 0065-2458
∞ The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper).
C ONTRIBUTORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
P REFACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Donald J. Bagert
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Licensing of Software Engineers . . . . . . . . . . . . . . . . . . . . . . . 6
3. The Certification of Software Developers . . . . . . . . . . . . . . . . . . 27
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Cognitive Hacking
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2. Examples of Cognitive Hacking . . . . . . . . . . . . . . . . . . . . . . . 44
3. Economic and Digital Government Issues Related to Cognitive Hacking . 53
4. Legal Issues Related to Cognitive Hacking . . . . . . . . . . . . . . . . . 57
5. Cognitive Hacking Countermeasures . . . . . . . . . . . . . . . . . . . . 60
6. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7. Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
v
vi CONTENTS
Warren Harrison
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2. Digital Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3. The Forensics Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4. An Illustrative Case Study: Credit Card Fraud . . . . . . . . . . . . . . . 110
5. Law Enforcement and Digital Forensics . . . . . . . . . . . . . . . . . . . 115
6. Organizational Structures of Digital Forensics Capabilities . . . . . . . . 116
7. Research Issues in Digital Forensics . . . . . . . . . . . . . . . . . . . . . 117
8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Crispin Cowan
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2. The Problem: Combining Reliability and Security . . . . . . . . . . . . . 122
3. Survivability Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4. Evaluating Survivability . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Smart Cards
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
2. A Typical Smart Card Transaction . . . . . . . . . . . . . . . . . . . . . . 154
3. Smart Card Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4. Smart Card Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5. Associated Access Technologies . . . . . . . . . . . . . . . . . . . . . . . 173
6. Smart Card Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7. Future Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Glossary of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
CONTENTS vii
Mihai Pop
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
2. Shotgun Sequencing Overview . . . . . . . . . . . . . . . . . . . . . . . . 196
3. Assembly Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4. Assembly Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
5. Exotic Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
2. Front End Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . 251
3. The Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
4. Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
5. Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
6. Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
7. Performance Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
ix
x CONTRIBUTORS
Michael Picheny is the Manager of the Speech and Language Algorithms Group in
the Human Language Technologies Group at the IBM TJ Watson Research Center.
Michael has worked in the Speech Recognition area since 1981, joining IBM after
finishing his doctorate at MIT. He has been heavily involved in the development of al-
most all of IBM’s recognition systems, ranging from the world’s first real-time large
vocabulary discrete system through IBM’s current ViaVoice product line. Michael
served as an Associate Editor of the IEEE Transactions on Acoustics, Speech, and
Signal Processing from 1986–1989, is currently a member of the Speech Technical
Committee of the IEEE Signal Processing Society and its representative to the Signal
Processing Society conference board, and is a Fellow of the IEEE.
J. Drew Procaccino is an Assistant Professor of Computer Information Systems in
the College of Business Administration at Rider University (Lawrenceville, NJ). His
teaching interests include office productivity software, systems analysis and design,
systems development and database design. His research interests include electronic
commerce, software engineering, biometrics and smart card technology.
Katherine M. Shelfer is Assistant Professor of Competitive Intelligence, College of
Information Science and Technology, Drexel University in Philadelphia, PA, where
she also directs the Competitive Intelligence Certificate Program. Her research and
teaching interests include competitive intelligence, sources of business information
and the strategic implications of information systems and services, specifically smart
card systems. Dr. Shelfer has published work on smart cards in Communications
of the ACM, the Defense Intelligence Journal and Knowledge Management for the
Information Professional, among others.
Paul Thompson is senior research engineer at the Institute for Security Technology
Studies and Thayer School of Engineering at Dartmouth College. His research inter-
ests include document retrieval, information extraction, and computer security. He
received a Ph.D. in library and information studies from the University of California,
Berkeley.
Geoffrey Zweig is the Manager of Advanced LVCSR Research at the IBM TJ Wat-
son Research Center. He received a B.A. degree in Physics with highest honors in
1985, and a Ph.D. in Computer Science in 1998, from The University of California
at Berkeley. Following his thesis on the application of Bayesian Networks to ASR,
Geoffrey joined IBM in 1998. In 2001, he co-authored the Graphical Models Toolkit
for speech recognition. He participated in the 2001 DARPA-sponsored HUB-5 eval-
uations, and again in the 2003 EARS Rich Transcription evaluation. Geoffrey is cur-
rently a member of the IEEE, and Associate Editor of IEEE Transactions on Speech
and Audio Processing. His research interests include the development of multi-scale
acoustic models, the application of machine learning techniques to speech recogni-
tion, and reliable, efficient decoding techniques.
This page intentionally left blank
Preface
Advances in Computers is the oldest series to provide an annual update to the con-
tinuously changing information technology field. It has been continually published
since 1960. Within each volume are usually six to eight chapters describing new de-
velopments in software, hardware, or uses of computers. In this 60th volume of the
series, subtitled Information Security, the focus of most of the chapters is on changes
to the information technology landscape involving security issues. With the growing
ubiquity of the Internet and its growing importance in the everyday life, the need
to address computer security issues is growing. The first 5 chapters describe aspect
of this information security problem. The final two chapters present other topics of
great interest and importance today—genome sequencing and speech recognition.
In Chapter 1, “Licensing and certification of software professionals,” Professor
Donald J. Bagert discusses the current controversy of certifying software profession-
als. Should software engineers be licensed? What does that mean? What is the body
of knowledge that defines what a software professional should know? Should edu-
cational programs be accredited like most engineering programs? All of these are
hotly debated today, and given the impact that computer software has on the world’s
economy, some resolution to these issues must be forthcoming.
Any user of computers today should understand the danger that viruses, worms,
and Trojan horses have on the integrity of their computer system. Most attacks are
known after they occur and after the damage has already been done. But in Chapter 2,
“Cognitive Hacking” by George Cybenko, Annarita Giani, and Paul Thompson, the
authors discuss a different form of attack where neither hardware nor software is nec-
essarily corrupted. Rather the computer system is used to influence people’s percep-
tions and behavior through misinformation. For example, anyone can post anything
on a Web page with few limitations. The issue is how to deal with false information
on the Web and how to decide whether a source is reliable. This and related topics
are the focus of this chapter.
Most people, criminals included, store information on computers. After a crime
has been committed and a suspect arrested, how do you provide evidence in a court
of law that an illegal action did occur and that the suspect was indeed responsible?
This is the domain of computer forensics. In Chapter 3, Warren Harrison discusses
xiii
xiv PREFACE
Marvin Zelkowitz
University of Maryland,
College Park, MD, USA
Licensing and Certification of Software
Professionals
DONALD J. BAGERT
Rose-Hulman Institute of Technology
5500 Wabash Avenue, CM97
Terre Haute, IN, 47803-3999
USA
Don.Bagert@rose-hulman.edu
Abstract
For many years, software organizations have needed to hire developers with a
wide range of academic and professional qualifications, due to the ongoing short-
age of individuals qualified to create and maintain the products required to sat-
isfy marketplace demand. Many of these companies have used the certification
credentials of such individuals to help judge whether they have the proper back-
ground for the development requirements of their particular software organiza-
tion. Certification is a voluntary process intended to document the achievement
of some level of skill or capability. Such certification can be awarded through
a variety of organizations. To date, company-based certification programs have
been dominant in the software field. These programs have been created and run
by a particular company, and are usually centered on determining an individ-
ual’s qualification to use a particular type of software that is marketed by that
business. However, these programs are often limited in scope, and sometimes
make it possible to acquire certification with little practical software develop-
ment background or formal training.
However, there have recently been a growing number of efforts to provide
more comprehensive certification programs for software professionals through
professional societies and independent organizations. Some of such certificates
are offered as a specialization in areas that in a number of fields are a part of the
product development process, e.g., quality assurance and project management. In
other cases, there are programs intended to certify individuals for having general
knowledge and abilities across a wide range of software development areas. In
some countries, such certification of software engineering professionals is done
on a nationwide basis by an engineering professional society.
There has also been an increased interest in the licensing of software engi-
neering professionals. Licensing is a more formal version of certification that
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2. Area of Competency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3. Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4. Renewal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. Licensing of Software Engineers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1. The Nature and Development of Software Engineering . . . . . . . . . . . . . 6
2.2. The Guide to the Software Engineering Body of Knowledge (SWEBOK) . . . 9
2.3. Software Engineering Degree Programs and Accreditation . . . . . . . . . . . 13
2.4. Legal Issues in Professional Licensing . . . . . . . . . . . . . . . . . . . . . . . 17
2.5. Pros and Cons of Licensing Software Engineers . . . . . . . . . . . . . . . . . 19
2.6. Examples of Licensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7. Examples of National Certification . . . . . . . . . . . . . . . . . . . . . . . . 25
3. The Certification of Software Developers . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1. Institute-Based Certification Programs . . . . . . . . . . . . . . . . . . . . . . 27
3.2. Company-Based Certification Programs . . . . . . . . . . . . . . . . . . . . . . 29
3.3. Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 29
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1. Introduction
1.1 Overview
The number of software professionals in the workforce is large and growing. In
the United States, a December 2001 study by the Department of Labor [39] stated
that the number of people employed by software engineers in the United States was
LICENSING AND CERTIFICATION OF SOFTWARE PROFESSIONALS 3
697,000, and it was expected that there would be 1.36 million software engineering
jobs available in the U.S. by 2010. (These numbers are in addition to computer pro-
grammers and other information technology-related positions, which number over
2.2 million in 2000, and is projected to grow to over 3.5 million by 2010.) In addi-
tion, these workers come from a wide variety of educational and work experiences.
Since there is such a large and diverse pool of potential workers, many employers
will look to see if job applicants possess any professional certifications or licenses as
one factor in evaluating their credentials.
Certification is a voluntary process intended to document the achievement of some
level of skill or capability. This type of certification can be given through a variety
of organizations, such companies, professional societies, or institutes who primary
function is to award such credentials. To date, company-based certification programs
have been dominant in the software field. Such programs have been created and run
by a particular company (such as Microsoft), and are usually centered on determin-
ing an individual’s qualification to write programs or otherwise use software that is
marketed by that business. Therefore, company-based certification is by definition
usually limited in scope, and in some cases is possible to acquire with little practical
software development experience and no formal education. The Microsoft Certified
Solution Developer (MCSD) program is the one that is most related to software en-
gineering offered by Microsoft, and so will be discussed in more detail later in this
article.
However, due to the aforementioned large and growing size of the software engi-
neering community, there have recently been a growing number of efforts to provide
more comprehensive certification programs for software professionals through pro-
fessional societies and independent organizations. Some such institute-based certi-
fications are offered as a specialization in areas that in a number of fields are a part
of the product development process, e.g., the Software Quality Engineer certifica-
tion provided by the American Society for Quality (ASQ); others, such the IEEE
Certified Software Development Professional (CSDP) program, are intended to cer-
tify individuals for having general knowledge and abilities across a wide range of
software development areas.
There has also been an increased interest in the licensing of software engineering
professionals, especially in North America. Licensing is a more formal version of
certification that involves a government-sanctioned or government-specified process,
with the health, safety and welfare of the public in mind. Since engineering is a field
where licensing is commonplace in many countries, most of this effort has focused
on the licensing of software engineers. However, while licensing is commonplace is
professions such as law and medicine, until recently it has been virtually unknown
in the information technology (IT) field. A number of IT professionals have raised
a variety of concerns about the licensing of software engineers, including issues re-
4 D.J. BAGERT
lated to liability, the body of knowledge upon which to base examinations, and the
appropriateness of the engineering model for such licensing.
Some countries either implicitly or explicitly designate that nation’s primary engi-
neering professional society to certify engineers. Such national certification of pro-
fessional engineers is done through a process similar to licensing in the United States,
except that national certification (as the name implies) is a voluntary process.
This article will examine the various licensing and certification initiatives, includ-
ing the history of its development, the process and content of such programs, and
the arguments both for and against licensing and certification. The remainder of Sec-
tion 1 looks how to view the area of competency for which someone is being certified
or licensed, and provides an outline of the steps commonly required in a certifica-
tion or licensing process. Section 2 will examine licensing and national certification,
Section 3 looks at institute-based and company-based certification; and the final sec-
tion will make some conclusions and outline some possible future directions for the
licensing and certification of software professionals.
high schools and two-year colleges in the United States offer courses to help students
prepare for particular company-based certification exams [6].
Most professions also have a code of ethics and professional conduct as defined by
either a related professional society or a legal authority. The two major U.S. comput-
ing professional societies have developed such a code for software engineering (see
Section 2.1). Most professional licensing jurisdictions will specify a code of ethics
and professional conduct required of all licensees. An organization which manages
a certification program usually has a code of ethics which applicants agree to adhere
to as part of their certification.
1.3 Procedure
One or more of the following pieces of information is commonly used in the as-
sessment of an application for certification or a license:
• Educational background
• Work experience and professional references
• Examinations passed
As previously stated, most professional licenses require a related baccalaureate de-
gree. Some certifications (such as CSDP) require a particular formal educational
background, while others (such as the Microsoft certifications) do not.
Most engineering license boards require four or more years of engineering expe-
rience, preferably under the supervision of a licensed engineer. They also require
references from professional engineers who have had an opportunity to observe the
applicant’s work. The CSDP requires 9000 hours of software development experi-
ence, but no accompanying references. The Microsoft certification programs recom-
mend some practical experience before attempting their examinations, but do not
require it (or any references).
Most certification programs have a person apply before administering any exam-
inations. In the U.S., the application for a state engineering license is submitted af-
ter the applicant has passed two-nationally administered examinations; however, In
Texas, however, there is a rule which allows the waiver of such exams with addi-
tional experiences and references (see Section 2.5). A test on that state’s engineering
practice laws is also usually required for licensure.
There is also an application fee, and there may be separate examination fees.
1.4 Renewal
Most certification programs are for a limited time, and require some type of re-
certification mechanism. Some licensing boards only require the payment of annual
6 D.J. BAGERT
1.5 Summary
Table I summarizes the different aspects of licensing, and the different types of
certification.
7
8 D.J. BAGERT
At any rate, since “software engineering” is firmly entrenched in the lexicon, the
ramifications of such terminology need to be addressed. At first, software engineer-
ing was considered a specialization; however, in the last twenty years it has been
increasingly regarded as a separate discipline and profession. This is turn has po-
tential ramifications for the licensing and certification. Frailey [23] asserts that four
facts need to be established in order to determine that licensing or certification of
software engineers:
1. That software engineering is a separate discipline,
2. That software engineering is a profession,
3. That software engineering is sufficiently established to justify certification or
licensing, and
4. That certification or licensing of software engineering would be beneficial
enough to justify the effort to establish them.
In the United Kingdom, the engineering and computing communities came to the
conclusion over a decade ago that these facts were indeed established, and thus began
creating undergraduate software engineering degree programs, and bestowing Char-
tered Engineer status to qualified individuals in the field. However, in the United
States and other countries, the process has been somewhat slower, and even today
there are many in the global engineering and computing communities that feel that
software engineering is not a separate discipline, and even if it is, is not engineering.
An important step came in 1993 with the creation of the ad hoc Joint IEEE Com-
puter Society and ACM Steering Committee for the Establishment of Software En-
gineering as a Profession (http://www.computer.org/tab/seprof). Although the Asso-
ciation of Computing Machinery (ACM) and the Computer Society of the Institute
of Electrical and Electronics Engineers, Inc. (IEEE-CS) are both based in the United
States, they each have a significant international component, and consider themselves
as representing the computing community worldwide.
The mission statement of the Joint Steering Committee was “To establish the ap-
propriate sets(s) of criteria and norms for professional practice of software engi-
neering upon which industrial decisions, professional certification, and educational
curricula can be based”. They established three task forces: Ethics and Professional
Practices, Body of Knowledge and Recommended Practices, and Education. (Note
that these correspond to the essential components for licensing and certification de-
scribed in Section 1.2.)
Work by the ethics task force proceeded quickly, and Software Engineering Code
of Ethics and Professional Practice was approved by both ACM and the Computer
Society in 1999 [25]. The body of knowledge task force did a “pilot survey” of
software engineers to gather some initial data concerning the body of knowledge
[20]. It was apparent from the creation of this survey that a volunteer task force would
LICENSING AND CERTIFICATION OF SOFTWARE PROFESSIONALS 9
not have the resources required to properly compile the software engineering body
of knowledge, which led to the creation of the Guide to the Software Engineering
Body of Knowledge (SWEBOK) project in 1997. (SWEBOK will be discussed in
detail in Section 2.2.)
Since the education task force’s objective was to develop Software Engineering
Education recommendations based on the work of the body of knowledge task force,
most of their work would be delayed until there was further definition of the software
engineering BOK. In the meantime, accreditation efforts went forward in Australia,
Canada and the United States (see Section 2.3).
With the recognition that the ad hoc joint steering committee would need
to be an ongoing effort, in late 1998 ACM and IEEE-CS voted to replace
that group with the Software Engineering Coordinating Committee (SWEcc or
SWECC), which would be a standing committee of the two societies (homepage
http://computer.org/tab/swecc.htm). The mission of SWECC was similar to that of
its predecessor, except that its structure would be different, in that instead of taking
on projects itself, it would coordinate various software engineering-related projects
approved and funded by the two societies. So, the new SWEBOK project reported
to SWECC, as well as the Software Engineering Education Project (SWEEP), which
was a continuation of the Education Task Force of the ad hoc committee.
All seemed to be progressing well—except that many (especially in ACM) were
very concerned that all the elements required for the licensing of software engineers
were coming into place, as will be discussed in Section 2.5. This eventually led to
the effective dissolution of SWECC.
competent, capable software engineers should be equipped with this knowledge for
potential application.”
Furthermore, it was intended (through a recommendation by the SWEBOK Indus-
trial Advisory Board) that this generally accepted knowledge would be appropriate
in the study material for a software engineering licensing examination that graduates
would take after gaining four years of work experience.
Version 1.0 of the SWEBOK Guide classifies the information compiled on soft-
ware engineering using ten knowledge areas (KA), as shown in Table II. Each KA
was authored by a leading expert in that particular area. The Guide also identifies
seven disciplines related to (but not part of) software engineering, such as computer
science and project management (Table III).
The resulting SWEBOK guide is one that reads somewhat like the very popular
software engineering textbooks written by Pressman [34] and Sommerville [36], but
has two major differences: it is intended for a different audience (practitioners rather
than college students), and is the result of a more widespread review and consensus,
as opposed to the vision of a single author.
TABLE II
SWEBOK K NOWLEDGE A REAS
Software Requirements
Software Design
Software Construction
Software Testing
Software Maintenance
Software Configuration Management
Software Engineering Management
Software Engineering Process
Software Engineering Tools and Methods
Software Quality
TABLE III
SWEBOK R ELATED D ISCIPLINES
Computer Science
Mathematics
Project Management
Computer Engineering
Systems Engineering
Management and Management Science
Cognitive Science and Human Factors
12 D.J. BAGERT
The report goes on to recommend that the ACM Council, the society’s governing
board, withdraw ACM from the SWEBOK effort, which they did, in June 2000 (more
details are provided in Section 2.5).
Another prominent critic of the SWEBOK Guide is Cem Kaner. Dr. Kaner is a
Professor of Computer Science at the Florida Institute of Technology as well as a
lawyer, author of a book on software testing, and consultant. In [29], he states some
of his concerns about SWEBOK from a legal point-of-view:
“The SWEBOK unconditionally endorses the IEEE Standard 829 for software test
documentation. . . We are not aware of scientific research that demonstrates that Stan-
dard 829 is a good method, or a better method than others, or desirable under studied
circumstances. Standard 829 is in the Body of Knowledge because it won a popu-
larity contest—it was endorsed (or not strongly enough opposed) by the authors of
SWEBOK and those relatively few [less than 500 total] people who chose to partici-
pate in the SWEBOK drafting and review process so far. . . What is the consequence?
A software engineer who recommends against the use of Standard 829 puts herself
at [legal] risk. . . An official Body of Knowledge creates an orthodoxy that we do not
have today. If the orthodox practices are not well founded in science, as so much
of SWEBOK seems not to be, the evolution of the field from weak (but orthodox)
practices to better ones will be the subject of lawsuit after lawsuit. And the results of
those suits will be determined by non-engineers who don’t understand the practices.”
Further criticisms of SWEBOK by Dr. Kaner can be found at his web site
http://www.kaner.com.
Although it is clear that SWEBOK does not have the general consensus its spon-
sors have sought, it should also be noted that are also a great number of proponents
of the SWEBOK Guide. Also, ACM’s current record related to body of knowledge
issues for software engineering is not as clear cut as it might seem, as they are work-
ing on a body of education knowledge project which has had some relation to SWE-
BOK (Section 2.3) and are supporting certification exams in software engineering,
which by definition require a body of knowledge (Section 3.1). The fact also re-
mains that despite the opposition of the ACM Council and others, software engineers
are currently being licensed or undergoing national certification in several countries
(or parts thereof), the qualifications for such licensing or comprehensive certifica-
tion must be assessed against a body of knowledge, and that SWEBOK is the most
prominent body of knowledge artifact that currently exists.
school year, there were six accredited programs, and two others being reviewed for
accreditation. Accreditation in that country is done by Canadian Engineering Ac-
creditation Board (CEAB), which is part of the Canadian Council of Professional
Engineers, which is the entity which licenses professional engineers there.
The first undergraduate software engineering program in the United States was
started at the Rochester Institute of Technology in the fall of 1996. In the late 1990s,
ABET approved criteria for accrediting software engineering under its Engineering
Accreditation Commission. The first undergraduate software engineering programs
were considered in the 2002–03 accreditation cycle; at least four schools have pub-
licly stated that they were visited by ABET in the fall of 2002.
It is of interest to look at the ABET/EAC software engineering criteria in more
detail, since (as will be seen) it does not have the close relationship to licensing or
national certification that the other countries mentioned here do. The ABET/EAC
criteria [1] contains eight general criteria, of which Criterion 4 (Professional Com-
ponent) and Criterion 8 (Program Criteria) specific address requirements for specific
curriculum content.
Criterion 4 states that “The professional component must include: (a) one year
of a combination of college level mathematics and basic sciences (some with ex-
perimental experience) appropriate to the discipline; (b) one and one-half years of
engineering topics, consisting of engineering sciences and engineering design ap-
propriate to the student’s field of study. . .” Note that this means that the continuous
mathematical subjects (e.g., calculus and differential equations) taken by most engi-
neering disciplines do no necessarily need to be taken by software engineers. Also,
since ABET allows computer science courses to be used as engineering sciences,
software engineering majors are not required to take traditional engineering sciences
such as statics and thermodynamics. (However, as will be seen later, the licensing
examination for graduating engineers in the U.S. still does require continuous math-
ematics and traditional engineering sciences.)
Criterion 8 specifies criteria for each individual engineering discipline. The cur-
riculum section of the software engineering criteria states that “The curriculum must
provide both breadth and depth across the range of engineering and computer sci-
ence topics implied by the title and objectives of the program. The program must
demonstrate that graduates have: the ability to analyze, design, verify, validate, im-
plement, apply, and maintain software systems; the ability to appropriately apply
discrete mathematics, probability and statistics, and relevant topics in computer sci-
ence and supporting disciplines to complex software systems; and the ability to work
in one or more significant application domains.” Note that the “application domain”
section of the criteria addresses one of the concerns expressed by Notkin, Gorlick
and Shaw in their report to ACM concerning SWEBOK.
16 D.J. BAGERT
The lead “society” for accrediting software engineering programs within ABET
is CSAB (http://www.csab.org), a joint ACM/IEEE-CS organization. ACM and the
Computer Society are also collaborating on the development of a curriculum model;
the Computing Curricula-Software Engineering (CCSE) project (which was for-
merly the aforementioned SWEEP) is intended to provide detailed undergraduate
software engineering curriculum guidelines which could serve as a model for higher
education institutions across the world. The first major component of this project
was the development of Software Engineering Education Knowledge (SEEK) [35],
a collection of topics considered important in the education of software engineering
students. SEEK was created and reviewed by volunteers in the software engineering
education community. The SEEK body is a three-level hierarchy, initially divided
into knowledge areas (KAs). Those KAs are then further divided into units, and fi-
nally, those units are divided into topics.
Each topic in SEEK is also categorized for its importance: Essential, Desired, or
Optional. There are currently over 200 essential topics which under a North Amer-
ican educational model. Essential topics are also annotated with indicators from
Bloom’s Taxonomy in the Cognitive Domain [12] to show the level of mastery ex-
pected. SEEK only uses three of the six Bloom Taxonomy values: knowledge, com-
prehension, and application.
SEEK is important in relative to accreditation in that if it is adopted by the ma-
jor computing societies, then an argument can be made that an accredited software
engineering program should be following its guidelines, in the same way that the
SWEBOK Guide might be used in relation to licensing. A worldwide survey of bac-
calaureate software engineering programs [11] revealed that many of them are using
SEEK—even though it is still only in draft form—as an instrument to determine if
the proper core software engineering knowledge is being addressed in their respec-
tive curricula.
In fact, SEEK and SWEBOK share a number of similarities, including a great
deal of overlap in their respective knowledge areas, although they have different
requirements and target audiences. These similarities led to the developers of the
two projects to hold a workshop to suggest improvements to both artifacts at the 2002
Software Technology and Engineering Practice (STEP) conference in Montreal. The
STEP post-conference proceedings contained three papers developed as an outcome
of the workshop, including a preliminary mapping of SWEBOK to SEEK [15]. It is
also interesting to note that ACM (a co-sponsor of SEEK) is still playing an indirect
role in the development of SWEBOK.
There is little controversy today over the concept of software engineering as an
academic discipline, although there are still are some individuals that believe that
having the such degrees only at the graduate level is essential for providing the proper
depth in computer science and other related disciplines. (A panel discussing the rel-
LICENSING AND CERTIFICATION OF SOFTWARE PROFESSIONALS 17
properly in a particular situation. For instance, if a doctor treats a patient, and the
patient dies, but it is determined that the physician followed established medical
practices in that particular case, then doctor is not legally liable for any actions taken,
despite the patient’s death. Proponents of the SWEBOK Guide claim that it contains
such generally-agreed upon practices for software engineering, while others such as
Notkin, Shaw, and Kaner disagree with that assertion.
The liability issue is one that has caused great concern to some segments of the
computing community. Kaner writes that
“For several years, computer malpractice has been a losing lawsuit because to be
sued for malpractice (professional negligence), you must be (or claim to be) a
member of a profession. Software development and software testing are not pro-
fessionals as this term is usually used in malpractice law. Therefore, malpractice
suits programmers and tester fail. . .
“So why does it matter whether malpractice is a viable type of lawsuit? Mal-
practice suits are more serious than suits for breach of contract or simple
negligence. . . Licensing [of software engineers] will lead to one thing: malprac-
tice liability. If a state government declares us a profession and starts licensing
us, that state’s courts will accept us as professionals, and that means they will
allow lawsuits for computer malpractice.” [28]
If Kaner’s view is accurate, then the legal shielding provided by licensing might well
be outweighed by the damage to licensed software engineers that would be caused
through increased liability through now being open to malpractice suits. An ACM
Task Force on safety-critical software chaired by John Knight of the University of
Virginia and Nancy Leveson of MIT (with Kaner as one of its other four members)
addressed additional concerns related to the malpractice issue, stating:
“The process of determining that an act constitutes malpractice is only partially
driven by engineers. In a typical lawsuit for malpractice, an injured or economi-
cally harmed person sues the engineer, often as part of a broader lawsuit. The per-
son will bring evidence that the engineer acted in ways, or made decisions, that
were not up to the professional standards of software engineering. This evidence
will be evaluated by lawyers, liability insurance company staff and lawyers, ju-
rors, and judges. None of these people are engineers. If juries find that certain
conduct is negligent, malpractice insurers will probably advise their insureds
(engineers) against engaging in that conduct and may well provide incentives, in
the form of lower insurance rates, for engineers who adopt recommended prac-
tices or who do not engage in counter-recommended practices. Over time, the
determination of what constitute good engineering practices may be driven more
by the courts and insurance companies than by engineers.” [30]
requires all individuals using the term “engineer” to be licensed, then the issue be-
comes whether to have a process by which someone can gain the legal right to use
the term “software engineer” when advertising for services, or allow no one to do so.
This was the issue in Texas at the time licensing of software engineers began there.
However, if a particular jurisdiction such as California is only required to license par-
ticular specified engineering disciplines, then the choice is whether to allow anyone
to use the term “software engineer”, or to regulate its use.
In either case, the question for a legal jurisdiction is: should it specify criteria by
which some people can call themselves a software engineer, and some not, and if so,
what that criteria will be.
Some ACM Council members then provided their viewpoint in the February 2000
issue of Communications of the ACM; however, it is interesting that of the four re-
sponses that appeared in the “Forum” section of the May 2000 issue, three took
issue with the ACM Council’s decision, while the fourth supported it, but anony-
mously [7].
Subsequent to their decision on licensing, the ACM Council commissioned two
other reports: the aforementioned one on the body of knowledge, as well as a re-
port focusing on the potential licensing of software engineering working on safety
critical systems (the aforementioned ACM Task Force chaired by Knight and Leve-
son). The latter report [30] recommends that “No attempt should be made to license
software engineers engaged in the development of safety-critical software using the
existing [United States] PE mechanism”, that body of knowledge efforts should not
be pursued, and that instead educational efforts should be increased.
Finally, as a result of these events, the ACM Council passed on 30 June 2000 (the
last day of that Council’s term) a motion to withdraw from SWECC, because, in
the Council’s opinion “SWECC has become so closely identified with licensing of
software engineers under a professional engineer model” [3]. The July 2000 issue of
Forum for Advancing Software engineering Education (FASE) contains several arti-
cles related to the ACM reports and their withdrawal from SWECC, with comments
by several noted computing professionals, including Dennis Frailey, who was one of
the ACM representatives on SWECC. Frailey opposed ACM’s actions in withdraw-
ing from SWECC, stating that:
The work of over 400 volunteers from about 50 countries has been cast in doubt
on the basis of weak and often incorrect rationale. . . I am also disappointed at
the exclusionary process by which this decision was reached. The Council and
its task forces chose not to consider the views and insights of ACM’s appointed
SWEcc representatives and project leads—the people actually doing the work—
even after we offered to provide information and to assist in the discussions and
22 D.J. BAGERT
deliberations. The draft rationale for this decision contains incorrect assumptions
and factual errors that we could easily have corrected and that may well have
influenced the Council’s decision [22].
The September 2000 and September 2001 issues of FASE each had a series of follow-
up articles on the withdrawal; the issues are available through the FASE website
at http://www.cs.ttu.edu/fase. All of the ACM-related documents can be found at
http://www.acm.org/serving/se_policy. Communications of the ACM had a section
in its November 2002 issue with several articles on the licensing issue.
The joint ACM/IEEE-CS software engineering curriculum effort under SWECC
went forward under the new CCSE acronym, but only after a year’s delay. As has
been discussed, the Computer Society went forward with the SWEBOK Guide as
scheduled following ACM’s withdrawal from the project.
2.6.1 Canada
In Canada, most engineers must be licensed in order to practice engineering.
This licensing of professional engineers is done on the province and territorial
level. The Canadian Council of Professional Engineers, according to its website at
http://www.ccpe.ca, is “the national organization of the 12 provincial and territorial
[bodies] that regulate the practice of engineering in Canada and license the country’s
more than 160,000 professional engineers.”
The first Canadian province to provide licensing guidelines for software engineers
was Professional Engineers Ontario (PEO), in 1999. (Previously, software practi-
tioners had been assessed by PEO on an individual basis to see if they qualified for
a Professional Engineer (PEng) license.) The CCPE has since adopted guidelines
which can be used by each of the twelve licensing bodies within the Council.
The Guideline on Admission to the Practice of Engineering in Canada specifies
the PEng admissions requirements which apply to all of the seventeen engineering
disciplines currently licensed by CCPE. In order to be licensed, applicants must:
• Be academically qualified;
• Have obtained sufficient acceptable engineering work experience in their area
of practice;
• Have an understanding of local practices and conditions;
LICENSING AND CERTIFICATION OF SOFTWARE PROFESSIONALS 23
Council of Examiners for Engineering and Surveying (NCEES), which creates engi-
neering licensing exams for most of the United States. (The NCEES board consists
of representatives of the various state licensing boards.) The general path to licensure
as a professional engineer (PE) is as follows:
• Obtain a degree from an ABET/EAC accredited program,
• Pass the NCEES Fundamentals of Engineering (FE) examination given to peo-
ple who are about to graduate or have recently graduated with a bachelor’s de-
gree,
• Work a minimum of four years under the supervision of a Professional Engineer,
• Pass the NCEES Principles and Practices of Engineering (P&P) exam most re-
lated to the engineering work being done by the applicant,
• Obtain professional references, and
• Pass an examination of the code of ethics and professional conduct for that
particular state.
The FE exam is divided into morning and afternoon sessions. The morning part of the
FE exam is the same for all applicants and includes mathematics (calculus, differen-
tial equations, linear algebra), lab sciences (chemistry and physics), and engineering
sciences (e.g., statics, materials, and thermodynamics). Note that although there is
an implication is that a student graduating from an accredited engineering program
in the United States should have the educational background sufficient for passing
the FE morning section, although the ABET criteria and the morning section content
may be significantly different—and in the case of software engineering, they are. It
is not surprising then that unlike the other countries discussed here, there are two
completely independent entities doing the accreditation and the licensing.
The FE afternoon session can be discipline-specific (for a limited number of dis-
ciplines), or the applicant can take a general examination which in many ways is an
extension of the morning session.
NCEES will not offer P&P or FE afternoon examinations for engineering disci-
plines that do not have at least one accredited program in that area, thus there can
be no action until ABET accredits the first software engineering programs (expected
in July 2003). At that time, at least ten state licensing boards would have to request
that NCEES develop an exam in order for such a project to be considered. However,
for at least the time being, all aspects of the NCEES licensing exams are somewhat
divergent from software engineering curriculum content.
Texas is the only state that currently licenses software engineers as PEs. It has
been doing so since 1998. Because there are no software engineering licensing ex-
ams, the Texas Board of Professional Engineers has been using its waiver clause
(which is available for all engineering disciplines licensed by the board) to license
LICENSING AND CERTIFICATION OF SOFTWARE PROFESSIONALS 25
2.7.2 Australia
The certification of professional engineers in Australia is done through IEAust
http://www.ieaust.org.au, which (as with the British Computer Society) also ac-
credits degree programs. IEAust also has several levels of membership, of which
Chartered Professional Engineer (CPEng) is both the highest possible level and the
one which is analogous to the professional engineer designation in other countries.
However, IEAust does not have the same relationship with its government as BCS
does with the British government, since the Australian government supports self-
regulation of professionals through the professional societies.
As with the BCS, an applicant for a CPEng must first be a member of IEAust
at another grade level [27]. For instance, a Member IEAust (MIEAust) must have
a degree from an accredited engineering program, have a minimum of three years
of professional experience, and have the support of a member at an equivalent of
higher grade. An applicant for a CEng must also submit background information
in the form of an Engineering Practice Report (EPR), which must be verified as
satisfactory by IEAust. Once this is done, the applicant will be invited to a one-hour
professional interview conducted by CPEng members from the applicant’s chosen
engineering discipline. This interview will be a peer review of the competencies that
the applicant has claimed in the EPR, plus test knowledge of the Institution’s Code of
Ethics. Those individuals receiving the CPEng designation can request to be placed
on the National Professional Engineers Register (NPER).
Currently, software engineers are not allowed to receive the CPEng designation
or be placed on the NPER. In 2001, the Australian Computer Society (ACS) and
IEAust formed a Joint Software Engineering Board, and developed a discussion pa-
per on the topic of software engineering as a professional discipline. ACS is currently
petitioning IEAust to allow software engineering to be added to the list of approved
engineering disciplines; [27] lists “Information, Telecommunications and Electron-
ics Engineering” (whose description includes software engineering) as a general area
of engineering practice, but also notes that registration on the NPER under this engi-
neering is not available as of yet.
2.7.3 Ireland
The Institution of Engineers of Ireland certifies its Chartered Engineers in a man-
ner vary similar to that of IEAust. However, unlike Australia, IEI has already started
chartering software engineers during the last few years. Unlike their counterparts in
Australia and the UK, IEI does classify software engineers separately.
LICENSING AND CERTIFICATION OF SOFTWARE PROFESSIONALS 27
3.1.1 ASQ
Founded in 1946, The American Society for Quality (ASQ) provides training and
certification in a variety of quality areas, including one specifically for software pro-
fessionals, the Certified Software Quality Engineer (CSQE). This program, accord-
ing to a page on the ASQ website http://www.asq.org/cert/types/index.html, is “De-
signed for those who have a comprehensive understanding of software quality devel-
opment and implementation; have a thorough understanding of software inspection,
testing, verification, and validation; and can implement software development and
maintenance processes and methods.”
In order to take the CSQE exam, an applicant must have at least eight years of ex-
perience, some of which can be waived due to the person’s educational background.
The CSQE is one of two examples shown here (the other being the Microsoft
Certified Systems Engineer) where the term “engineer” is granted on a certification,
but its use in some jurisdictions (such as certain states or provinces in North Amer-
ica) might be illegal. If intending to practice in such jurisdictions, Kaner suggests
contacting ASQ to see if an alternate name to the certification can be provided [28].
3.1.2 CSDP
The IEEE Computer Society has developed a competency recognition program
for software professionals [38]. During the development, the program was known
as the “Certified Software Engineering Professional Program”, but due to the same
legal issues mentioned above regarding the use of the term “engineer”, the name was
ultimately changed to Certified Software Development Professional (CSDP). Despite
the name change, it is still software engineering knowledge being tested (although
SWEBOK was not used for the test specifications).
The overall certification program includes requirements on education, professional
experience, passing an examination, and continuing education. To be eligible to take
the exam, a candidate must have, at minimum, a bachelor’s degree and 9000 hours of
28 D.J. BAGERT
3.1.3 ICCP
The Institute for Certification of Computing Professionals (ICCP) is located in
Des Plaines, Illinois, USA, and is under the direction of seven constituent societies,
including ACM. ICCP has certified over 55,000 people as Certified Computer Pro-
fessionals over their 30 years of existence. ICCP offers an exam in a number of
software-related areas, including one software engineering which includes the fol-
lowing topics:
• Computer System Engineering
• Software Project Planning
• Software Requirements
• Software Design
• Programming Languages and Coding
• Software Quality Assurance
• Software Testing Techniques
• Software Maintenance and Configuration Management
It is interesting to note the similarities between the Knowledge areas of the SWE-
BOK Guide and the above topics, since ACM has publicly stated it concerns related
LICENSING AND CERTIFICATION OF SOFTWARE PROFESSIONALS 29
its concerns about defining a body of knowledge sufficient for licensing examina-
tions, and yet has continued to offer a certification exam on similar material.
As with the ASQ Certified Software Quality Engineer exam, an applicant must
have work experience, the amount of which can be reduced according to the person’s
educational background.
sion of it to the medical model, where doctors must be licensed in order to practice
medicine, and can obtain board certification in specialization areas such as internal
medicine or dermatology. For software engineering, such specialization might in-
clude specific job functions (e.g., software project manager) or specific application
domains such as real-time systems. (The application domain specialization area is
one that is not currently addressed by either the knowledge areas of the SWEBOK
Guide or in current licensing and certification efforts.) Frailey also notes that medical
technician certifications such as for the use of X-Ray machines are roughly equiva-
lent most of the company-based certifications.
This model for licensing is also interesting because it could include most of the li-
censing and certification efforts that currently exist. For example, the Software Qual-
ity Engineer certification could be a specialization area obtained by a licensed soft-
ware engineer. It may be that through the use of such models, which are different
from that used in most engineering disciplines, the unique field of software engineer-
ing can find a method of licensing and certification that can be to the most benefit to
both the software professional community and the public-at-large.
ACKNOWLEDGEMENTS
The author appreciates the invaluable research help provided by the following
individuals: Jocelyn Armarego, David Budgen, Robert Cochran, Ana Moreno, Fred
Otto, Michael Ryan, J. Barrie Thompson and Alan Underwood.
R EFERENCES
[6] Bagert D.J., “SIGCSE survey, TCEA discussion on company-based certification”, Forum
for Advancing Software engineering Education (FASE) 10 (3) (March 2000) (electronic
newsletter), http://www.cs.ttu.edu/fase/v10n03.txt.
[7] Bagert D.J., “Communications of the ACM Forum on Licensing Issue”, Forum for Ad-
vancing Software engineering Education (FASE) 10 (5) (May 2000) (electronic newslet-
ter), http://www.cs.ttu.edu/fase/v10n05.txt.
[8] Bagert D.J., Mead N.R., “Software engineering as a professional discipline”, Computer
Science Education 11 (1) (March 2001) 73–87.
[9] Bagert D.J., “Education and training in software engineering”, in: Encyclopedia of Soft-
ware Engineering, second ed., John Wiley and Sons, 2002, pp. 452–465.
[10] Bagert D.J., “Texas licensing of software engineers: All’s quiet—for now”, Communi-
cations of the ACM 45 (11) (November 2002) 92–94.
[11] Bagert D.J., Ardis M.A., “Software engineering baccalaureate programs in the United
States: An overview”, in: Proceedings of Frontiers in Education Conference, Boulder,
Colorado, USA, 5–8 November 2003, submitted for publication.
[12] Bloom B.J., et al. (Eds.), Taxonomy of Educational Objectives: Handbook I: Cognitive
Domain, first ed., David McKay Co., New York, NY, 1956.
[13] Bourque P., Dupuis R., Abran A., Moore J.W., Tripp L., “The guide to the software
engineering body of knowledge”, IEEE Software 16 (6) (November/December 1999)
35–44.
[14] Bourque P., Dupuis R. (Eds.), SWEBOK: A Guide to the Software Engineering Body of
Knowledge (Trial Version 1.00), IEEE Computer Society, Los Alamitos, CA, USA, May
2001.
[15] Bourque P., Robert F., Lavoie J.-M., Lee A., Trudel S., Lethbridge T.C., “Guide to
the software engineering body of knowledge (SWEBOK) and the software engineering
education knowledge (SEEK)—A preliminary mapping”, in: Proceedings of the 10th
International Workshop on Software Technology and Engineering Practice, Montreal,
Canada, 6–8 October 2002, pp. 8–23.
[16] British Computer Society and The Institution of Electrical Engineers, A Report on Un-
dergraduate Curricula for Software Engineering, Institution of Electrical Engineers,
1989.
[17] Canadian Council of Professional Engineers, Guideline on Admission to the Practice of
Engineering in Canada, 2001.
[18] Canadian Council of Professional Engineers, “Canadian Council of Professional Engi-
neers and Microsoft Corp. agree on use of ‘Engineer’ title”, news release, 11 May 2001;
Reprinted in Forum for Advancing Software engineering Education (FASE) 11 (6) (June
2001) (electronic newsletter), http://www.cs.ttu.edu/fase/v11n06.txt.
[19] DeMarco T., “Certification or Decertification?”, Communications of the ACM 42 (7)
(July 1999) 10 (letter to the editor).
[20] Douglas P., Cocchi T., “Report on analyses of pilot software engineer survey data”, Joint
Steering Committee of IEEE Computer Society/ACM for Establishment of Software
Engineering as a Profession, http://www.computer.org/tab/seprof/survey.htm, 27 March
1997.
LICENSING AND CERTIFICATION OF SOFTWARE PROFESSIONALS 33
[21] Ford G., Gibbs N., A Mature Profession of Software Engineering, Technical Re-
port CMU/SEI-96-TR-004, Software Engineering Institute, Carnegie Mellon University,
Pittsburgh, PA, 1996.
[22] Frailey D.J., “Statement regarding ACM’s withdrawal from SWEcc”, Forum for Advanc-
ing Software engineering Education (FASE) 10 (7) (July 2000) (electronic newsletter),
http://www.cs.ttu.edu/fase/v10n07.txt.
[23] Frailey D.J., “Licensing and certification of software engineering personnel”, in: Ency-
clopedia of Software Engineering, second ed., John Wiley and Sons, 2002, pp. 452–465.
[24] Frailey D.J., “Software engineering grows up”, IEEE Software 16 (6) (Novem-
ber/December 1999), pp. 66, 68.
[25] Gotterbarn D., Miller K., Rogerson S., “Software engineering Code of Ethics is ap-
proved”, Communications of the ACM 42 (10) (October 1999) 102–107.
[26] “Illinois Compiled Statutes, Professions and Occupations, Professional Engineering
Practice Act of 1989 (amended), amendment effective 1 January 2002, Chapter 225,
Statue 325, Section 9”, http://www.legis.state.il.us/legislation/ilcs/ch225/ch225act325.
htm.
[27] Institution of Engineers, Australia, Chartered Professional Engineers, May 2002.
[28] Kaner C., “Computer Malpractice”, http://www.kaner.com/malprac.htm, 2000. This is
an updated version of a paper with the same title that appeared in Software QA 3 (4)
(1996) 23.
[29] Kaner C., “Software engineering as a profession after the withdrawal: One year later”,
Forum for Advancing Software engineering Education (FASE) 11 (9) (September 2001)
(electronic newsletter), http://www.cs.ttu.edu/fase/v11n09.txt.
[30] Knight J., Leveson N., DeWalt M., Elliot L., Kaner C., Nissenbaum H., On Licensing
of Software Engineers Working on Safety-Critical Software, Association for Computing
Machinery, August 2001, http://www.acm.org/serving/se_policy/safety_critical.pdf.
[31] National Council of Examiners for Engineering and Surveying, Model Law, Revised
August 2002.
[32] Notkin D., Gorlick M., Shaw M., An Assessment of Software Engineering Body of
Knowledge Efforts, Association for Computing Machinery, New York, May 2000,
http://www.acm.org/serving/se_policy/bok_assessment.pdf.
[33] Parnas D.L., “Licensing of software engineers in Canada”, Communications of the
ACM 45 (11) (November 2002) 94–96.
[34] Pressman R.S., Software Engineering: A Practitioner’s Approach, fifth ed., McGraw-
Hill, Boston, MA, 2001.
[35] Sobel A.E.K. (Ed.), Second Draft of the Software Engineering Education Knowledge, 6
December 2002, http://sites.computer.org/ccse/know/SecondDraft.pdf.
[36] Sommerville I., Software Engineering, sixth ed., Addison–Wesley, Wokingham, Eng-
land, 2000.
[37] Thompson J.B., Edwards H.M., “Software engineering in the UK 2001”, Forum for Ad-
vancing Software engineering Education (FASE) 11 (11) (November 2001) (electronic
newsletter), http://www.cs.ttu.edu/fase/v11n11.txt.
34 D.J. BAGERT
[38] Tockey S., “IEEE Computer Society develops competency recognition program”, Fo-
rum for Advancing Software engineering Education (FASE) 11 (9) (September 2001)
(electronic newsletter), http://www.cs.ttu.edu/fase/v11n09.txt.
[39] U.S. Department of Labor, “The 2000–10 job outlook in brief”, Occupational Outlook
Quarterly 46 (1) (Spring 2002) 9–43.
Cognitive Hacking
GEORGE CYBENKO
Thayer School of Engineering
Dartmouth College
8000 Cummings Hall Hanover, NH 03755-8000
USA
george.cybenko@dartmouth.edu
ANNARITA GIANI
Institute for Security Technology Studies
Thayer School of Engineering
Dartmouth College
8000 Cummings Hall Hanover, NH 03755-8000
USA
annarita.giani@dartmouth.edu
PAUL THOMPSON
Institute for Security Technology Studies
Thayer School of Engineering
Dartmouth College
8000 Cummings Hall Hanover, NH 03755-8000
USA
paul.thompson@dartmouth.edu
Abstract
In this chapter, we define and propose countermeasures for a category of com-
puter security exploits which we call “cognitive hacking.” Cognitive hacking
refers to a computer or information system attack that relies on changing human
users’ perceptions and corresponding behaviors in order to be successful. This
is in contrast to denial of service (DOS) and other kinds of well-known attacks
that operate solely within the computer and network infrastructure. Examples are
given of several cognitive hacking techniques, and a taxonomy for these types of
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.2. Perception Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.3. Computer Security Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.4. Semantic Attacks and Information Warfare . . . . . . . . . . . . . . . . . . . . 41
1.5. Deception Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.6. Cognitive Hacking and Intelligence and Security Informatics . . . . . . . . . . 42
2. Examples of Cognitive Hacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.1. Internet Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2. Insider Threat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3. Economic and Digital Government Issues Related to Cognitive Hacking . . . . . . . 53
3.1. An Information Theoretic Model of Cognitive Hacking . . . . . . . . . . . . . 53
3.2. Theories of the Firm and Cognitive Hacking . . . . . . . . . . . . . . . . . . . 56
3.3. Digital Government and Cognitive Hacking . . . . . . . . . . . . . . . . . . . . 56
4. Legal Issues Related to Cognitive Hacking . . . . . . . . . . . . . . . . . . . . . . . 57
5. Cognitive Hacking Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1. Single Source Cognitive Hacking . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2. Multiple Source Cognitive Hacking . . . . . . . . . . . . . . . . . . . . . . . . 63
6. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7. Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
1. Introduction
F IG . 1.
1.1 Background
Computer and network security presents great challenges to our evolving informa-
tion society and economy. The variety and complexity of cybersecurity attacks that
have been developed parallel the variety and complexity of the information tech-
nologies that have been deployed; with no end in sight for either. In this chapter,
we distinguish three classes of information systems attacks: physical, syntactic, and
cognitive. Physical and syntactic attacks can be considered together as autonomous
attacks.
38 G. CYBENKO ET AL.
Autonomous attacks operate totally within the fabric of the computing and net-
working infrastructures. For example, the well-know Unicode attack against older,
unpatched versions of Microsoft’s Internet Information Server (IIS) can lead to
root/administrator access. Once such access is obtained, any number of undesired
activities by the attacker is possible. For example, files containing private informa-
tion such as credit card numbers can be downloaded and used by an attacker. Such
an attack does not require any intervention by users of the attacked system; hence,
we call it an “autonomous” attack.
By contrast, a cognitive attack requires some change in users’ behavior, effected
by manipulating their perceptions of reality. The attack’s desired outcome cannot be
achieved unless human users change their behaviors in some way. Users’ modified
actions are a critical link in a cognitive attack’s sequencing. To illustrate what we
mean by a cognitive attack, consider the following news report [62]:
“Friday morning, just as the trading day began, a shocking company press release
from Emulex (Nasdaq: EMLX) hit the media waves. The release claimed that
Emulex was suffering the corporate version of a nuclear holocaust. It stated that
the most recent quarter’s earnings would be revised from a $0.25 per share gain to
a $0.15 loss in order to comply with Generally Accepted Accounting Principles
(GAAP), and that net earnings from 1998 and 1999 would also be revised. It also
said Emulex’s CEO, Paul Folino, had resigned and that the company was under
investigation by the Securities and Exchange Commission.
Trouble is, none of it was true.
The real trouble was that Emulex shares plummeted from their Thursday close of
$113 per share to $43—a rapid 61% haircut that took more than $2.5 billion off
of the company’s hide—before the shares were halted an hour later. The damage
had been done: More than 3 million shares had traded hands at the artificially
low rates. Emulex vociferously refuted the authenticity of the press release, and
by the end of the day the company’s shares closed within a few percentage points
of where they had opened.”
Mark Jakob, 23 years old, fraudulently posted the bogus release on Internet Wire,
a Los Angeles press-release distribution firm. The release was picked up by several
business news services and widely redistributed scale without independent verifica-
tion. The speed, scale and subtlety with which networked information propagates
have created a new challenge for society, outside the domain of classical computer
security which has traditionally been concerned with ensuring that all use of a com-
puter and network system is authorized.
The use of information to affect the behavior of humans is not new. Language, or
more generally communication, is used by one person to influence another. Propa-
ganda has long been used by governments, or by other groups, particularly in time
of war, to influence populations [19,30,34]. Although the message conveyed by pro-
COGNITIVE HACKING 39
Over the last 20 years earlier attempts in computer security research to develop
formal security models designed into the computer system were abandoned as vari-
ous models were shown to be undecidable [29]. As more and more elaborate security
approaches were developed, they were rendered out of date by rapid changes in the
computing environment, in particular, the development of the World Wide Web. In
recent years dramatic and costly computer viruses, denial of service attacks, and
concerns with the security of e-commerce have drawn the attention of researchers
in computer security. While these security breaches are a serious concern, cognitive
hacking is a form of attack that deserves to receive more attention.
Cognitive hacking is defined here as gaining access to, or breaking into, a com-
puter information system for the purpose of modifying certain behaviors of a human
user in a way that violates the integrity of the overall user-information system. The
integrity of such a system would, for example, include correctness or validity of the
information the user gets from such a system. In this context, the integrity of a com-
puter system can be defined more broadly than the definition implicit in Landwehr’s
definition of computer security. Smith [87] refers to breaches in computer security
as violations of the semantics of the computer system, i.e., the intended operation of
the system. Wing argues a similar view [95]. In this sense the World Wide Web itself
can be seen as a computer system used for communication, e-commerce, and so on.
As such, activities conducted over the Web that violate the norms of communication
or commerce, for example, fraud and propaganda, are considered to be instances
of cognitive hacking, even if they do not involve illegitimate access to, or breaking
into, a computer. For example, a person might maintain a website that presents mis-
information with the intent of influencing viewers of the information to engage in
fraudulent commercial transactions with the owner of the website.
Some examples of cognitive hacking, such as the manipulation of a Yahoo news
item [68], are instances of Landwehr’s second threat, the unauthorized modification
of information [55], but there are many other examples that do not fit Landwehr’s
taxonomy. This is not surprising, because, as suggested by Landwehr, new applica-
tions, i.e., in this case Web services, will have new security needs, which must be
understood.
Two broad classes of cognitive hacking can be distinguished: overt and covert.
With overt cognitive hacking no attempt is made to conceal the fact that a cognitive
hack has occurred. For example, website defacement is a type of overt cognitive
hacking. While a website defacer may hope that the defacement is not noticed for as
long as possible by a Web page administrator, the Web defacer’s modification to the
website is a blatant modification that the intended audience will realize immediately
is not the unmodified website.
It has been estimated that 90% of attacks on Web pages are total page hacks,
where the attacker replaces the entire page of the attacked site [45]. This overt cog-
COGNITIVE HACKING 41
nitive hacking, while much more prevalent than other covert forms discussed in this
chapter, is more of a nuisance to website administrators and an embarrassment to
website owners. Covert cognitive hacking, by contrast, is likely to have more signif-
icant consequences, because it can influence a user’s perceptions and behavior.
Misinformation is often a covert form of cognitive hacking. Misinformation is an
intentional distribution or insertion of false or misleading information intended to
influence reader’s decisions and/or activities. The open nature of the Internet makes
it an ideal arena for the dissemination of misinformation [17].
The distinction between overt and covert cognitive hacking is not necessarily un-
ambiguous. For example, the recent hacking of the Yahoo news story [68] was clearly
not as blatantly overt as a typical Web defacement. On the other hand, the changes
introduced were obvious enough that the suspicions of a careful reader would soon
be aroused. It is possible, though, to imagine subtler modifications that would have
gone undetected by any reader not already familiar with a more reliable account of
the news item.
TABLE I
H ACKING WITH THE G OAL OF M ODIFYING U SER B EHAVIOR
Denial of Service
Theft of Services 8 8, 15
Theft of Information 4
Fraud Financial 1, 2, 3, 4, 5
Fraud non-Financial 6, 7
Political 10, 11, 14, 15, 17 17
Commercial or Private 6, 9 6
Perception Management
Self-aggrandizement 12, 13, 15
White Hat Hack 13, 16
ranging from $0.05 to $0.17 per share. They opened many Internet message board
accounts using a computer at the UCLA BioMedical Library and posted more than
500 messages on hot websites to pump up the stock of the company, stating false
information about the company with the purpose of convincing others to buy stock
in the company. They claimed that the company was being taken over and that the
target price per share was between 5 and 10 dollars. Using other accounts they also
pretended to be an imaginary third party, a wireless telecommunications company,
interested in acquiring NEI Webworld. What the three men did not post was the fact
that NEI was bankrupt and had liquidated assets in May 1999. The stock price rose
from $0.13 to $15 in less then one day, and they realized about $364,000 in prof-
its. The men were accused of selling their shares incrementally, setting target prices
along the way as the stock rose. On one day the stock opened at $8 and soared to $15
5/16 a share by 9:45 a.m. ET and by 10:14 a.m. ET, when the men no longer had any
shares, the stock was worth a mere 25 cents a share.
On Wednesday, December 15, 1999, the US Securities and Exchange Commission
(SEC) and the United States Attorney for the Central District of California charged
the three men with manipulating the price of NEI Webworld, Inc. In late January
2001, two of them, agreed to gave up their illegal trading profits (approximately
$211,000). The Commission also filed a new action naming a fourth individual, as
participating in the NEI Webworld and other Internet manipulations. Two of the men
were sentenced on January 22, 2001 to 15 months incarceration and 10 months in a
community corrections center. In addition to the incarcerations, Judge Feess ordered
the men to pay restitution of between $566,000 and $724,000. The judge was to
hold a hearing on February 26 to set a specific figure [86]. Anyone with access to
a computer can use as many screen names as desired to spread rumors in an effort
to pump up stock prices by posting false information about a particular company so
that they can dump their own shares and give the impression that their own action
has been above board.
He sent this kind of message after having bought a block of stocks. The purpose
was to influence people and let them behave to pump up the price by recommending
the stock. The messages looked credible and people did not even think to investigate
the source of the messages before making decisions about their money. Jonathan
gained $800,000 in six months. Initially the SEC forced him to give up everything,
but he fought the ruling and was able to keep part of what he gained. The question
is whether he did something wrong, in which case the SEC should have kept every-
thing. The fact that the SEC allowed Jonathan to keep a certain amount of money
shows that it is not clear whether or not the teenager is guilty from a legal per-
COGNITIVE HACKING 47
spective. Certainly, he made people believe that the same message was post by 200
different people.
Richard Walker, the SEC’s director of enforcement, referring to similar cases,
stated that on the Internet there is no clearly defined border between reliable and
unreliable information, investors must exercise extreme caution when they receive
investment pitches online.
2.1.4 PayPal.com
“We regret to inform you that your username and password have been lost in
our database. To help resolve this matter, we request that you supply your login
information at the following website.”
Many customers of PayPal received this kind of e-mail and subsequently gave per-
sonal information about their PayPal account to the site linked by the message
(http://paypalsecure.com not http://www.paypal.com) [52]. The alleged perpetrators
apparently used their access to PayPal accounts in order to purchase items on eBay.
Jones, Bloomberg, and CBS Marketwatch picked up the hoax. Due to this false in-
formation, in a few hours Emulex Corporation lost over $2 billion dollars. After
sending misinformation about the company, Jakob executed trades so that he earned
$236,000. Jakob was arrested and charged with disseminating a false press release
and with security fraud. He is subject to a maximum of 25 years in prison, a maxi-
mum fine of $220 million, two times investor losses, and an order of restitution up to
$110 million to the victims of his action.
The insider threat, is much more pervasive, however, than a small number of high
profile national security cases. It has been estimated that the majority of all computer
security breeches are due to insider attacks, rather than to external hacking [4].
As organizations move to more and more automated information processing en-
vironments, it becomes potentially possible to detect signs of insider misuse much
earlier than has previously been possible. Information systems can be instrumented
to record all uses of the system, down to the monitoring of individual keystrokes and
mouse movements. Commercial organizations have made use of such clickstream
mining, as well as analysis of transactions to build profiles of individual users. Credit
card companies build models of individuals’ purchase patterns to detect fraudulent
usage. Companies such as Amazon.com analyze purchase behavior of individual
users to make recommendations for the purchase of additional products, likely to
match the individual user’s profile.
A technologically adept insider, however, may be aware of countermeasures de-
ployed against him, or her, and operate in such a way as to neutralize the counter-
measures. In other words, an insider can engage in cognitive hacking against the
network and system administrators. A similar situation arises with Web search en-
gines, where what has been referred to as a cold war exists between Web search
engines and search engine optimizers, i.e., marketers who manipulate Web search
engine rankings on behalf of their clients.
Models of insiders can be built based on:
(a) known past examples of insider misuse;
(b) the insider’s work role in the organization;
(c) the insider’s transactions with the information system; and
(d) the content of the insider’s work product.
This approach to the analysis of the behavior of the insider is analogous to that sug-
gested for analyzing the behavior of software programs by Munson and Wimer [70].
One aspect of this approach is to look for known signatures of insider misuse, or for
anomalies in each of the behavioral models individually. Another aspect is to look
for discrepancies among the models. For example, if an insider is disguising the true
intent of his, or her, transactions by making deceptive transactions that disguise the
true nature of what the insider is doing, then this cognitive hacking might be uncov-
ered by comparing the transactions to the other models described above, e.g., to the
insider’s work product.
User models have long been of interest to researchers in artificial intelligence and
in information retrieval [81,26,49]. Several on-going research programs have been
actively involved in user modeling for information retrieval. The Language Model-
ing approach to probabilistic information retrieval has begun to consider query (user)
models [53,54]. The Haystack project at MIT is building models of users based on
COGNITIVE HACKING 53
their interactions with a document retrieval system and the user’s collections of doc-
uments. The current focus of this project, however, seems to be more on overall
system architecture issues, rather than on user modeling as such [46].
The current type of user modeling that might provide the best basis for cognitive
hacking countermeasures is recommender system technology [92,93,44]. One of the
themes of the recommender systems workshop held at the 1999 SIGIR conference
[43] was the concern to make recommender systems applicable to problems of more
importance than selling products. Since then, recommender systems technology has
developed, but applications are generally still largely commercial. Researchers are
concerned with developing techniques that work well with sparse amounts of data
[31] and with scaling up to searching tens of millions of potential neighbors, as
opposed to the tens of thousands of today’s commercial systems [84]. Related to this
type of user modeling, Anderson and Khattak [5] described preliminary results with
the use of an information retrieval system to query an indexed audit trail database,
but this work was never completed [3].
where bi is the percentage of the available wealth invested on horse i at each race.
So the betting strategy that maximizes the total wealth gained is obtained by solving
the following optimization problem:
m
W (p, o) = max W (b, p, o) = max pi log bi oi
b b
i=1
subject to the constraint that the bi ’s add up to 1. It can be shown that this so-
m turns out to be simply b = p (proportional betting) and so W (p, o) =
lution
i=1 pi log pi oi .
Thus, a hacker can predict the strategy of a systematic gambler and make an attack
with the goal of deluding the gambler on his/her future gains. For example, a hacker
might lure an indecisive gambler to invest money on false prospects. In this case it
would be useful to understand how sensitive the function W is to p and o and tamper
with the data in order to convince a gambler that it is worth playing (because W
appears illusionary larger than it actually is).
To study the sensitivity of W to its domain variables we consider the partial deriv-
atives of W with respect to pi and oi and see where they assume the highest values.
This gives us information on how steep the function W is on subsets of its domain.
If we consider the special case of races involving only two horses (m = 2), then
we have
F IG . 2.
Thus, if we fix one of the variables then we can conduct a graphic analysis of those
functions with a 3D plot, see Fig. 2.
C ASE 1. o1 is constant. This is the doubling rate function. The most sensitive para-
meter to let W increase is o2 . Increasing this variable W grows at a fast rate for low
values of p and grows with a smaller rate for higher values of p.
A digital government workshop held in 2003 [72], focused on five scenarios for
future authentication policies with respect to digital identity:
• Adoption of a single national identifier;
• Sets of attributes;
• Business as usual, i.e., continuing growth of the use of ad hoc identifiers;
• Ubiquitous anonymity;
• Ubiquitous identify theft.
COGNITIVE HACKING 57
An attorney from Buchanan Ingersoll P.C. [11] provides this additional perspec-
tive:
Content providers, the businesses which hire them, as well as their advertising
agencies and Web site developers, may be liable directly to victims for false or
misleading advertising, copyright and trademark infringement, and possibly for
vicarious infringements. The difference between other forms of media and the
Internet is that the Internet contemplates the publishing and linking of one site
to another, which geometrically multiplies the possibility for exposure to claims.
The electronic benefits of the Internet promotes the linking of information con-
tained in one site to another and creates a higher likelihood of dissemination of
offending material.
The Lanham Act, 15 U.S.C. §1125(a) [65] has been applied to the prosecution of
false advertising on the Web. It provides that any person who “. . . uses in commerce
any word, term, name, symbol, or device. . . false description of origin, false or mis-
leading description of fact. . . which, (A) is likely to cause confusion, or to cause
mistake, or to deceive as to affiliation, connection, or association of such person with
another person, or as to the origin, sponsorship, or approval of his or her goods, ser-
vices, or commercial activities by another person, or (B) in commercial advertising
or promotion misrepresents the nature, characteristics, qualities, or geographic ori-
gin of his or her or another person’s goods, services, or commercial activities, shall
be liable in a civil action by any person who believes that he or she is likely to be
damaged by such an act.”
The Lanham Act, copyright, and trademark law, among other established areas of
the law, are being used to decide cases related to cognitive hacking. For example, in
the area of search engine optimization, if company A’s website uses the trademark of
company B in metatags for company A’s website in order to divert Web searches for
company B to company A’s website, then company A could be liable for trademark
infringement or false advertising under the Lanham Act. On the other hand, company
A could argue that its use of metatags was protected under trademark principles of
fair use or First Amendment principles of free speech. As a more extreme action,
company A might download the entire content of a popular website from company
B and incorporate it into company A’s website with all of the company B content
printed white-on-white background, so that it would be invisible to human viewers,
but visible to Web search engines. This would be a violation of copyright laws and
possibly also be considered unfair competition and trademark dilution [1].
The application of the law to cognitive hacking and other areas related to the Inter-
net is a very volatile area of the law. The events of September 2001 have only made
the situation more volatile as the debate between privacy and security has shifted. It
is to be expected that more legislation affecting this area will be enacted and that the
associated case law will continue to evolve over the coming years.
COGNITIVE HACKING 59
agreement could be the basis of legal actions for fraud or misrepresentation. Dissemi-
nating false information that damages the reputation of a person, business, or product
could lead to a libel, defamation, or commercial disparagement suit. Incorporating
trademarks of others as metatags that mislead consumers about the origin of goods
or services or reduce the goodwill associated with a mark could be reached through
the legal remedies provided by trademark and trademark antidilution statutes.
The special persuasive powers of computer output create an extra dimension of
legal concern. Humans are quick to believe what they read and see on their computer
screen. Even today, it is common to hear someone say a fact must be true because
they read it on the Web. A website’s anthropomorphic software agent is likely to
enjoy greater credibility than a human, yet no conscience will prevent an anthropo-
morphic agent from saying whatever it has been programmed to say [42]. Cognitive
hackers may, therefore, require new legal doctrines because their mechanisms appar-
ently bypass normal human critical thinking.
Still more elusive will be identifying and taking meaningful legal action against
the perpetrator. The speed and lack of human intervention that is typically associated
with cognitive hacking, combined with the lack of definitive identification informa-
tion generally inherent in the Internet’s present architecture, complicate legal proof
of who is the correct culprit. Privacy protection makes the task more difficult. Even
if identified, the individual or entity may disappear or lack the assets to pay fines or
damages.
Attention may, therefore, focus instead on third-party intermediaries, such as In-
ternet service providers, websites, search engines, and so forth, just as it has for
Internet libel, copyright infringement, and pornography. Intermediaries are likely to
have greater visibility and more assets, making legal action easier and more produc-
tive. A cognitive hacking victim might contend that an intermediary or the producer
of Web-related software failed to take reasonable measures to defend against cog-
nitive hacking. An intermediary’s legal responsibility will grow as the technological
means for blocking cognitive hacking become more effective and affordable. Rapid
technological advances with respect to anti-hacking tools would empower raising the
bar for what it considered reasonable care.
The actual application of the law to cognitive hacking is still in formation. It is to
be expected that case law with respect to cognitive hacking will continue to evolve
over the coming years. Enactment of specific legislation is also possible.
access to information assets (such as in Web defacements) in the first place or detect-
ing posted misinformation before user behavior is affected (that is, before behavior
is changed but possibly after the misinformation has been disseminated). The lat-
ter may not involve unauthorized access to information, as, for instance, in “pump
and dump” schemes that use newsgroups and chat rooms. By definition, detecting a
successful cognitive hack would involve detecting that the user behavior has already
been changed. We are not considering detection in that sense at this time.
Our discussion of methods for preventing cognitive hacking will be restricted to
approaches that could automatically alert users of problems with their information
source or sources (information on a Web page, newsgroup, chat room, and so on).
Techniques for preventing unauthorized access to information assets fall under the
general category of computer and network security and will not be considered here.
Similarly, detecting that users have already modified their behaviors as a result of the
misinformation, namely, that a cognitive hack has been successful, can be reduced to
detecting misinformation and correlating it with user behavior.
The cognitive hacking countermeasures discussed here will be primarily mathe-
matical and linguistic in nature. The use of linguistic techniques in computer se-
curity has been pioneered by Raskin and colleagues at Purdue University’s Center
for Education and Research in Information Assurance and Security [6]. Their work,
however, has not addressed cognitive hacking countermeasures.
be to the first to give reliable news about breaking stories that impact the business
environment. Such pressures are at odds with the time consuming process of veri-
fying accuracy. A compromise between the need to quickly disseminate information
and the need to investigate its accuracy is not easy to achieve in general.
Automated software tools could in principle help people make decisions about the
veracity of information they obtain from multiple networked information systems.
A discussion of such tools, which could operate at high speeds compared with human
analysis, follows.
his or her gains, but could pay a heavy price, if this quick action is taken based on
misinformation.
A cognitive hacking countermeasure is under development which will allow an
end user to effectively retrieve and analyze documents from the Web that are sim-
ilar to the original news item. First, a set of documents retrieved by the Google
News clustering algorithm. The Google News ranking of the clustered documents is
generic, not necessarily optimized as a countermeasure for cognitive attacks. We are
developing a combination process in which several different search engines are used
to provide alternative rankings of the documents initially retrieved by Google News.
The ranked lists from each of these search engines, along with the original rank-
ing from Google News, will be combined using the Combination of Expert Opinion
algorithm [64] to provide a more optimal ranking. Relevance feedback judgments
from the end user will be used to train the constituent search engines. It is expected
that this combination and training process will yield a better ranking than the initial
Google News ranking. This is an important feature in a countermeasure for cognitive
hacking, because a victim of cognitive hacking will want to detect misinformation as
soon as possible in real time.
This problem models a group of generals plotting a coup. Some generals are reli-
able and intend to go through with the conspiracy while others are feigning support
66 G. CYBENKO ET AL.
and in fact will support the incumbent ruler when the action starts. The problem is to
determine which generals are reliable and which are not.
Just as with the Ulam game model for a single information source, this model
assumes a sequence of interactions according to a protocol, something that is not
presently applicable to the cognitive hacking examples we have considered, although
this model is clearly relevant to the more sophisticated information sources that
might arise in the future.
6. Future Work
This chapter has defined a new concept in computer network security, cognitive
hacking. Cognitive hacking is related to other concepts, such as semantic hacking, in-
formation warfare, and persuasive technologies, but is unique in its focus on attacks
via a computer network against the mind of a user. Psychology and Communica-
tions researchers have investigated the closely related area of deception and detec-
tion in interpersonal communication, but have not yet begun to develop automated
countermeasures. We have argued that cognitive hacking is one of the main features
which distinguishes intelligence and security informatics from traditional scientific,
medical, or legal informatics. If, as claimed by psychologists studying interpersonal
68 G. CYBENKO ET AL.
ACKNOWLEDGEMENTS
Support for this research was provided by a Department of Defense Critical In-
frastructure Protection Fellowship grant with the Air Force Office of Scientific Re-
search, F49620-01-1-0272; Defense Advanced Research Projects Agency Projects
F30602-00-2-0585 and F30602-98-2-0107; and the Office of Justice Programs, Na-
tional Institute of Justice, Department of Justice Award 2000-DT-CX-K001 (S-1).
The views in this document are those of the authors and do not necessarily represent
the official position of the sponsoring agencies or of the US Government.
R EFERENCES
[1] Abel S., “Trademark issues in cyberspace: The brave new frontier”, http://library.lp.
findlaw.com/scripts/getfile.pl?file=/firms/fenwick/fw000023.html, 1998.
[2] Agre P., “The market logic of information”, Knowledge, Technology, and Policy 13 (1)
(2001) 67–77.
[3] Anderson R., Personal communication, 2002.
[4] Anderson R.H., Bozek T., Longstaff T., Meitzler W., Skroch M., Van Wyk K., “Re-
search on mitigating the insider threat to information systems – #2”, in: Proceedings of
a Workshop Held August 2000. RAND Technical Report CF163, RAND, Santa Monica,
CA, 2000.
[5] Anderson R., Khattak A., “The use of information retrieval techniques for intrusion
detection”, in: First International Workshop on Recent Advances in Intrusion Detection
(RAID), Louvain-la-Neuve, Belgium, 1998.
[6] Atallah M.J., McDonough C.J., Raskin V., Nirenburg S., “Natural language processing
for information assurance and security: An overview and implementations”, in: Pro-
ceedings of the 2000 Workshop on New Security Paradigms, 2001.
[7] BBC News Online, “Hamas hit by porn attack”, http://news.bbc.co.uk/low/english/
world/middle_east/newsid_1207000/1207551.stm, 2001.
[8] BBC News Online, “Sharon’s website hacked”, http://news.bbc.co.uk/low/english/
world/middle_east/newsid_1146000/1146436.stm, 2001.
[9] Biber D., Dimensions of Register Variation: A Cross-Linguistic Comparison, Cam-
bridge Univ. Press, Cambridge, UK, 1995.
COGNITIVE HACKING 69
[10] Biber D., “Spoken and written textual dimensions in English: Resolving the contradic-
tory findings”, Language 62 (2) (1986) 384–413.
[11] Buchanan Ingersoll P.C., “Avoiding web site liability—online and on the hook?”,
http://library.lp.findlaw.com/scripts/getfile.pl?file=/articles/bipc/bipc000056.html,
2001.
[12] Buller D.B., Burgoon J.K., “Interpersonal deception theory”, Communication The-
ory 6 (3) (1996) 203–242.
[13] Burgoon J.K., Blair J.P., Qin T., Nunamaker J.F., “Detecting deception through linguis-
tic analysis”, in: NSF/NIJ Symposium on Intelligence and Security Informatics, June 1–
3, 2003, Tucson, AZ, in: Lecture Notes in Computer Science, Springer-Verlag, Berlin,
2003, pp. 91–101.
[14] Cao J., Crews J.M., Lin M., Burgoon J.K., Nunamaker J.F., “Designing Agent99
trainer: a learner-centered, Web-based training system for deception detection”, in:
NSF/NIJ Symposium on Intelligence and Security Informatics, June 1–3, 2003, Tucson,
AZ, in: Lecture Notes in Computer Science, Springer-Verlag, Berlin, 2003, pp. 358–
365.
[15] Chandy K.M., Misra J., Parallel Program Design: A Foundation, Addison–Wesley,
Reading, MA, 1988.
[16] Chen H., Zeng D.D., Schroeder J., Miranda R., Demchak C., Madhusudan T. (Eds.),
Intelligence and Security Informatics: First NSF/NIJ Symposium ISI 2003, Tucson, AZ,
June 2003, Proceedings, Springer-Verlag, Berlin, 2003.
[17] Chez.com, “Disinformation on the Internet”, http://www.chez.com/loran/art_danger/
art_danger_on_internet.htm, 1997.
[18] Cignoli R.L.O., D’Ottaviano I.M.L., Mundici D., Algebraic Foundations of Many-
Valued Reasoning, Kluwer Academic, Boston, 1999.
[19] Combs J.E., Nimmo D., The New Propaganda: The Dictatorship of Palaver in Con-
temporary Politics, Longman, New York, 1993.
[20] Cooper W.S., Maron M.E., “Foundations of probabilistic and utility-theoretic index-
ing”, Journal of the Association for Computing Machinery 25 (1) (1978) 67–80.
[21] Cornetto K.M., “Identity and illusion on the Internet: Interpersonal deception and de-
tection in interactive Internet environments”, PhD thesis, University of Texas at Austin,
2001.
[22] Cover T.A., Thomas J.A., Elements of Information Theory, Wiley, New York, 1991.
[23] Cybenko G., Giani A., Thompson P., “Cognitive hacking and the value of information”,
in: Workshop on Economics and Information Security, May 16–17, Berkeley, CA, 2002.
[24] Cybenko G., Giani A., Thompson P., “Cognitive hacking: A battle for the mind”, IEEE
Computer 35 (8) (2002) 50–56.
[25] Cybenko G., Giani A., Heckman C., Thompson P., “Cognitive hacking: Technological
and legal issues”, in: LawTech 2002, November 7–9, 2002.
[26] Daniels P.J., Brooks H.M., Belkin N.J., “Using problem structures for driving human–
computer dialogues”, in: Sparck Jones K., Willett P. (Eds.), Readings in Informa-
tion Retrieval, Morgan Kaufmann, San Francisco, 1997, pp. 135–142. Reprinted from
RIAO-85 Actes: Recherche d’Informations Assistee par Ordinateur, France IMAG,
Grenoble, pp. 645–660.
70 G. CYBENKO ET AL.
[27] Dellarocas C., “Building trust on-line: The design of reliable reputation reporting mech-
anisms for online trading communities”, Center for eBusiness@MIT paper 101, 2001.
[28] Denning D., Information Warfare and Security, Addison–Wesley, Reading, MA, 1999.
[29] Denning D., “The limits of formal security models”, National Computer Systems Secu-
rity Award Acceptance Speech, 1999.
[30] Doob L., Propaganda, Its Psychology and Technique, Holt, New York, 1935.
[31] Drineas P., Kerendis I., Raghavan P., Competitive recommendation systems STOC’02,
May 19–21, 2002.
[32] eBay, Inc. v. Bidder’s Edge, Inc., 100 F. Supp. 2d 1058 (ND Cal., 2000).
[33] “Re-engineering in real time”, Economist (31 January, 2002), http://www.economist.
com/surveys/PrinterFriendly.cfm?Story_ID=949093.
[34] Ellul J., Propaganda, Knopf, New York, 1966, translated from French by Kellen K.,
Lerner L.
[35] Farahat A., Nunberg G., Chen F., “AuGEAS (Authoritativeness Grading, Estimation,
and Sorting)”, in: Proceedings of the International Conference on Knowledge Manage-
ment CIKM’02, 4–9 November, McLean, VA, 2002.
[36] Fawcett T., Provost F., in: Kloesgen W., Zytkow J. (Eds.), Handbook of Data Mining
and Knowledge Discovery, Oxford Univ. Press, 2002.
[37] Felton E.W., Balfanz D., Dean D., Wallach D., Web spoofing: An Internet con game.
Technical Report 54–96 (revised), Department of Computer Science, Princeton Univer-
sity, 1997.
[38] George J., Biros D.P., Burgoon J.K., Nunamaker Jr. J.F., “Training professionals to
detect deception”, in: NSF/NIJ Symposium on Intelligence and Security Informatics,
June 1–3, 2003, Tucson, AZ, in: Lecture, Notes in Computer Science, Springer-Verlag,
Berlin, 2003, pp. 366–370.
[39] Gertz v. Robert Welch, Inc., 428 US 323, 94 S.Ct. 2997, 41 L.Ed.2d 789 (1974).
[40] “Google News beta”, http://news.google.com/.
[41] “The Hacktivist. Fluffi Bunni hacker declares Jihad”, http://thehacktivist.com/article.
php?sid=40, 2001.
[42] Heckman C.J., Wobbrock J., “Put your best face forward: Anthropomorphic agents,
e-commerce consumers, and the law”, in: Fourth International Conference on Au-
tonomous Agents, June 3–7, Barcelona, Spain, 2000.
[43] Herlocker J. (Ed.), Recommender Systems: Papers and notes from the 2001 workshop,
In conjunction with the ACM SIGIR Conference on Research and Development in In-
formation Retrieval, New Orleans, 2001.
[44] Hofmann T., “What people (don’t) want”, in: European Conference on Machine Learn-
ing (ECML), 2001.
[45] Hunt A., Web Defacement Analysis, ISTS, 2001.
[46] Huynh D., Karger D., Quan D., “Haystack: A platform for creating, organizing and
visualizing information using RDF”, in: Intelligent User Interfaces (IUI), 2003.
[47] “Information Warfare Site”, http://www.iwar.org.uk/psyops/index.htm, 2001.
[48] Interpersonal Deception: Theory and Critique, Communication Theory 6 (3) (1996),
special issue.
COGNITIVE HACKING 71
[49] Johansson P., “User modeling in dialog systems”, St. Anna Report SAR 02-2, 2002.
[50] Karlgren J., Cutting D., Recognizing text genres with simple metrics using discriminant
analysis, 1994.
[51] Kessler B., Nunberg G., Schütze H., “Automatic detection of genre”, in: Proceedings
of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics
and Eighth Conference of the European Chapter of the Association for Computational
Linguistics, 1997.
[52] Krebs B., “E-mail Scam Sought to defraud PayPal customers”, Newsbytes (19 Decem-
ber, 2001), http://www.newsbytes.com/news/01/173120.html.
[53] Lafferty J., Chengxiang Z., “Document language models, query models, and risk mini-
mization for information retrieval”, in: 2001 ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR), 2001.
[54] Lafferty J., Chengxiang Z., “Probabilistic relevance models based on document and
query generation”, in: Proceedings of the Workshop on Language Modeling and Infor-
mation Retrieval, Carnegie Mellon University, 2001, Kluwer volume PT reviewing, in
press.
[55] Landwehr C.E., “A security model for military message systems”, ACM Transactions
on Computer Systems 9 (3) (1984).
[56] Landwehr C.E., “Formal models of computer security”, Computing Surveys 13 (3)
(1981).
[57] “Latimes.com., ‘Hacktivists’, caught in Web of hate, deface Afghan sites”, http://www.
latimes.com/technology/la-000077258sep27.story?coll=la%2Dheadlines%2Dtechnology,
2001.
[58] Lewis M., “Jonathan Lebed: Stock manipulator, S.E.C. Nemesis—and 15”, New York
Times Magazine (25 February, 2001).
[59] Lewis M., Next: The Future Just Happened, Norton, New York, 2001, pp. 35–36.
[60] Libicki M., “The mesh and the Net: Speculations on armed conflict in an age of
free silicon”, National Defense University, McNair Paper 28, http://www.ndu.edu/ndu/
inss/macnair/mcnair28/m028cont.html, 1994.
[61] Lynch C., “When documents deceive: Trust and provenance as new factors for infor-
mation retrieval in a tangled Web”, Journal of the American Society for Information
Science & Technology 52 (1) (2001) 12–17.
[62] Mann B., “Emulex fraud hurts all”, in: The Motley Fool, 2000, http://www.fool.com/
news/foolplate/2000/foolplate000828.htm.
[63] Maron M.E., Kuhns J.L., “On relevance, probabilistic indexing and information re-
trieval”, Journal of the ACM 7 (3) (1960) 216–244.
[64] Mateescu G., Sosonkina M., Thompson P., “A new model for probabilistic informa-
tion retrieval on the Web”, in: Second SIAM International Conference on Data Mining
(SDM 2002). Workshop on Web Analytic, 2002.
[65] “Matthew Bender and Company, Title 15. Commerce and Trade. Chapter 22. Trade-
marks general provisions. United States Code Service”, http://web.lexis-nexis.com/
congcomp/document?_m=46a301efb7693acc36c35058bee8e97d&_docnum=1&wchp=
dGLStS-lSlAA&_md5=5929f8114e1a7b40bbe0a7a7ca9d7dea, 2001.
72 G. CYBENKO ET AL.
[66] Mensik M., Fresen G., “Vulnerabilities of the Internet: An introduction to the ba-
sic legal issues that impact your organization”, http://library.lp.findlaw.com/scripts/
getfile.pl?file=/firms/bm/bm000007.html, 1996.
[67] Mosteller F., Wallace D.L., Inference and Disputed Authorship: The Federalist,
Addison–Wesley, Reading, MA, 1964.
[68] MSNBC, “Hacker alters news stories on Yahoo”, http://stacks.msnbc.com/news/
631231.asp, 2001.
[69] Mundici D., Trombetta A., “Optimal comparison strategies in Ulam’s searching game
with two errors”, Theoretical Computer Science 182 (1–2) (1997) (15 August).
[70] Munson J.C., Wimer S., “Watcher: The missing piece of the security puzzle”, in: 17th
Annual Computer Security Applications Conference (ACSAC’01), December 10–14,
New Orleans, LA, 2001.
[71] “National Center for Digital Government: Integrating Information and Government
John F. Kennedy School of Government Harvard University”, http://www.ksg.harvard.
edu/digitalcenter/.
[72] “National Center for Digital Government: Integrating Information and Government
“Identity: The Digital Government Civic Scenario Workshop” Cambridge, MA,
April 28–29, 2003, John F. Kennedy School of Government Harvard University”,
http://www.ksg.harvard.edu/digitalcenter/conference/.
[73] NetworkWorldFusion, “Clever fake of WTO web site harvests e-mail addresses”,
http://www.nwfusion.com/news/2001/1031wto.htm, 2001.
[74] New York v. Vinolas, 667 N.Y.S.2d 198 (N.Y. Crim. Ct. 1997).
[75] “Newsbytes. Pop singer’s death a hoax a top story at CNN”, http://www.newsbytes.
com/cgi-bin/udt/im.display.printable?client.id=newsbytes&story.id=170973, 2001.
[76] Pratkanis A.R., Aronson E., Age of Propaganda: The Everyday Use and Abuse of Per-
suasion, Freeman, New York, 1992.
[77] Rao J.R., Rhatgi P., “Can pseudonymity really guarantee privacy?”, in: Proceedings of
the 9th USENIX Security Symposium, Denver, CO, August 14–17, 2000.
[78] R.A.V. v. City of St. Paul, 505 U.S. 377, 112 S.Ct. 2538, 120 L.Ed.2d 305, 1992.
[79] “The Register. Intel hacker talks to The Reg”, http://www.theregister.co.uk/content/
archive/17000.html, 2001.
[80] “The Register. New York Times web site smoked”, http://www.theregister.co.uk/
content/6/16964.html, 2001.
[81] Rich E., “Users are individuals: Individualizing user models”, International Journal of
Man–Machine Studies 18 (3) (1983) 199–214.
[82] Van Rijsbergen C.J., Information Retrieval, second ed., Buttersworth, London, 1979.
[83] Salton G., McGill M., Introduction to Modern Information Retrieval, McGraw–Hill,
New York, 1983.
[84] Sarwar B., Karypis G., Konstan J., Reidl J., “Item-based collaborative filtering recom-
mendation algorithms”, in: WWW10, Hong Kong, May 1–5, 2001.
[85] Schneier B., “Semantic attacks: The third wave of network attacks”, Crypto-gram
Newsletter (October 15, 2000), http://www.counterpane.com/crypto-gram-0010.html.
[86] Smith A.K., “Trading in false tips exacts a price”, U.S. News & World Report (February
5, 2001), p. 40.
COGNITIVE HACKING 73
WARREN HARRISON
Portland State University and
Hillsboro Police Department
High Tech Crime Team
Portland, OR 97207-0751
USA
warren@cs.pdx.edu
Abstract
The use of computers to either directly or indirectly store evidence by criminals
has become more prevalent as society has become increasingly computerized. It
is now routine to find calendars, e-mails, financial account information, detailed
plans of crimes, and other artifacts that can be used as evidence in a criminal
case stored on a computer’s hard drive. Computer forensics is rapidly becoming
an essential part of the investigative process, at both local law enforcement levels
and federal levels. It is estimated that half of all federal criminal cases require
a computer forensics examination. This chapter will address the identification,
extraction, and presentation of evidence from electronic media as it is typically
performed within law enforcement agencies, describe the current state of the
practice, as well as discuss opportunities for new technologies.
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
1.1. Computers and Crime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2. Digital Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.1. Differences From Traditional Evidentiary Sources . . . . . . . . . . . . . . . . 79
2.2. Constraints on Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3. The Forensics Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.1. The Identification Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.2. The Preparation Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
1 The law regarding searching digital devices is complex. This chapter provides an overview of U.S. Fed-
eral rules for searching digital devices but is not intended to provide legal advice. The reader is urged to
seek competent legal counsel for specific questions.
1. Introduction
In September 1998, a 53-year old English physician was arrested for the murders
of at least 15 of his elderly patients by administering fatal doses of diamorphine, an
opiate sometimes used to relieve pain. He would come to be suspected of intention-
ally killing an additional 200 patients over a 23 year period. The case would turn into
the largest serial murder case in UK history.
When investigators searched the offices of the doctor’s practice, they found a net-
work of computers running a commercial medical record management system. This
system contained the medical records of each of his 3100 patients. Included in each
record was a manually entered date and transcription of written notes for every con-
tact the doctor had with each patient. The records of the patients he was suspected of
murdering each indicated a lengthy period of declining health, ultimately culminat-
ing in their death.
Unbeknownst to the doctor, an upgrade to the medical record management system
in October 1996, added an audit trail function. The audit trail recorded every entry
made as well as the date it was entered based on the computer’s system clock. Upon
forensic analysis of the computer, investigators found that the record for one patient
dated June 23 1997 indicated that she was a chronic morphine abuser. However,
upon examining the audit trail file, investigators found that the June 27, 1997 entry
was actually entered on June 25, 1998. Not only was the date of the entry falsified,
but the patient’s body was discovered on June 24, 1998, the day before the entry was
actually entered into the computer. Similar sorts of entries existed for other patients.
Based partially on the computer evidence as well as other facts and pieces of
evidence the physician was convicted. He was sentenced to 15 life terms.
THE DIGITAL DETECTIVE: AN INTRODUCTION TO DIGITAL FORENSICS 77
2. Digital Evidence
Digital forensics is concerned with obtaining “relevant evidence” from an elec-
tronic medium. “Relevant evidence” is simply any evidence that makes the existence
of a fact that is of consequence to the case either more or less probable than it would
be without the evidence.2 This can be as simple as an innocent e-mail between two
friends, or as sinister as a set of plans detailing the steps to be carried out to perform
a murder.
For instance, assume the defense of a suspect charged with forgery is based on
the premise that the forged documents found on his computer were not his but rather
belonged to his roommate. The probability of this fact is affected by whether or not
the dates and times associated with the last access and/or last modification of those
documents occurred at a time when it could be proven the suspect was somewhere
other than sitting in front of the computer. If the file access times coincided with dates
and times that the suspect was at work, the probability is high that the documents
were not his. Therefore, the dates and times of access and modification for a given
file may very well be “relevant evidence.” This also illustrates a key property of
electronic evidence: it has the potential for being both inculpatory (i.e., showing the
suspect is guilty) as well as exculpatory (i.e., showing the suspect is innocent).
Obviously a large number of different sorts of electronic artifacts and “meta-
artifacts” may serve as relevant evidence. As we have seen, an alibi may be sub-
stantiated by time stamped computer logs that put the accused somewhere other than
the crime scene when the offense was committed. Likewise, a series of e-mails may
indicate a relationship between a victim and a suspect. In fact, just some of the arti-
facts in which the digital investigator may be interested include [3]:
2 Strictly speaking an artifact does not become evidence unless its ability to prove a fact has been estab-
lished. Until then, it is “potential evidence.” However, for purposes of convenience, this chapter shall use
the term “evidence” to mean “potential evidence” except in cases where the alternate definition is clearly
intended.
THE DIGITAL DETECTIVE: AN INTRODUCTION TO DIGITAL FORENSICS 79
• e-mail messages;
• chat-room logs;
• ISP records;
• webpages;
• HTTP server logs;
• databases;
• digital directories;
• cookie files;
• word processing documents;
• graphical images;
• spreadsheets;
• address books;
• calendars;
• meta information about files.
Because these artifacts cannot be examined directly with the naked eye (they are,
after all, simply electrons recorded on some sort of electromagnetic device) they
are what is known as “latent evidence”—that is, evidence that requires equipment,
software and/or methods to make it discernable.
Conversely, the digital forensic specialist will perform a very different analysis
depending upon the crime. For instance, the approach taken to locate child pornog-
raphy on a computer is different than that taken to find evidence of identity theft.
Techniques and skills will also vary between a Windows desktop computer vs a
Linux webserver. Some skills and techniques, such as recovering deleted files from
an Apple II will become obsolete in practice after a relatively short time.
Reith, Carr, and Gunsch [6] have described a lifecycle model for conducting a dig-
ital forensics investigation. The development of such a model is useful both for those
involved in a digital forensics effort as well as for providing a taxonomy within which
technology and methods can be organized. The Reith, Carr, and Gunsch process
model is partially based on the U.S. Federal Bureau of Investigation’s Handbook of
Forensics Services crime scene search guidelines [7], and extends from the initial
recognition of potential evidence through the presentation of this evidence in a trial
and its return.
THE DIGITAL DETECTIVE: AN INTRODUCTION TO DIGITAL FORENSICS 81
The lifecycle described in this chapter is a slight modification of the one described
by Reith, Carr, and Gunsch, in that our discussion ends at the presentation phase since
the return of evidence has a more administrative rather than technological flavor.
The model described in this chapter also combines the Examination and Analysis
phases since from the view of the technologist, both activities appear to be temporally
intertwined.
The lifecycle model we will be discussing (see Fig. 1) consists of the following
phases:
1. Identification. Investigation of any criminal activity may produce digital ev-
idence. This phase deals with the recognition and identification of potential
digital evidence.
2. Preparation. Once a likely source of digital evidence has been recognized,
planning and various preparatory tasks must be carried out. For example, ac-
quiring permission to search is an extremely important piece of preparatory
work that must be done before digital evidence can be obtained.
3. Strategy. The goal of any effort to collect evidence is to maximize the collection
of untainted evidence while minimizing impact of the collection on the victim.
In this phase, a plan to acquire the evidence must be developed.
4. Preservation. To be useful, the state of physical and digital evidence must be
preserved for collection. This phase deals with securing both the physical area
as well as the contents of the digital device in question.
5. Collection. Ultimately, the digital evidence must be acquired. This is one of the
most critical phases of the entire process since it has the most obvious bearing
on the authenticity of the evidence.
6. Examination and Analysis. Examination involves searching the seized artifacts
for possible evidence, while Analysis involves determining the significance of
the evidence found, usually within the context of a theory of the crime. The two
phases are so intertwined from the technologist’s viewpoint that it is difficult
to discuss them separately. Examination is affected by the Analysis and vice-
versa.
7. Presentation. Ultimately, summarization of the conclusions drawn in the
Analysis phase as well as explanation of the techniques used in the Collec-
tion and Examination activities must be presented to the court. Because jurors
cannot be assumed to have any prior technical understanding of computing or
computers, this can pose a significant challenge.
In spite of increasingly technical demands, only some of the phases in this process
are typically performed by computer specialists. This is especially true for local (i.e.,
city police and county sheriffs) agencies. Often the digital evidence is seized by
personnel from local agencies and then transported to a state or federal agency for
82 W. HARRISON
the Examination phase. In this case, it is likely all the phases through Collection are
performed by uniformed officers or detectives with little, if any, special training.
The remainder of this section elaborates upon each of these phases and discusses
important issues pertinent to a law enforcement context. However, these issues are
THE DIGITAL DETECTIVE: AN INTRODUCTION TO DIGITAL FORENSICS 83
equally relevant to researchers and technical personnel, since they severely constrain
the application of both current and future technology.
The deciding issue with regards to the Fourth Amendment is if an individual has a
reasonable expectation of privacy. If a reasonable expectation of privacy exists, then
either an individual must provide voluntary consent to the search or a search warrant
must be obtained to allow a law enforcement officer or their agent to search an item.
A search warrant is a document, issued by a magistrate, giving legal authorization
for a search of a container, place or thing. Usually a search warrant narrowly defines
the purpose of the search and the items for which the police are searching.
Computers and other digital devices are generally treated like a closed container
such as a briefcase or file cabinet [5], and accessing information stored on the device
is similar to opening a closed container. In other words, lacking any action on the
owner’s part minimizing his expectation of privacy this means that law enforcement
generally requires a search warrant in order to access the contents of a digital storage
device.
However, based on the concept of “plain view” (an officer may search an item that
is in “plain view” without consent or a warrant since it is clear that there was no
expectation of privacy), certain actions by the owner of the computer may indicate
that there was no expectation of privacy. For instance, frequently loaning the com-
puter and user account information to others or taking the computer to a repair shop
indicate that the owner did not expect the contents of the computer to remain private.
In general, the courts have ruled that searching files not related to the original war-
rant (e.g., searching for child pornography when the original warrant was issued to
search for evidence of drug sales) exceeds the scope of a warrant.4 This is signifi-
cant from a technological point of view because if every file is considered a separate
closed container, then all examination of the evidence must be related to the original
search warrant.
To illustrate the idea of the scope of a warrant, consider a search warrant that
says an officer may search for records substantiating a computer was used by “X,”
the search of e-mail may be justifiable to determine if “X” ever used the computer to
send e-mail. On the other hand, opening graphic and audio files would be much more
difficult to justify given the bounds set by this warrant. Likewise, if the computer
were to contain the e-mail accounts of a number of other users, there would be no
authorization to search the e-mail sent or received by users other than “X.”
In order to obtain a search warrant, a sworn statement that explains the basis for
the belief that the search is justified by probable cause (the affidavit) and the pro-
posed warrant that describes the place to be searched and the things to be seized are
submitted to a magistrate. If the magistrate agrees that probable cause has been es-
tablished and the warrant is not unreasonably broad, they will approve the warrant.
4 If in the course of carrying out a search, if the examiner legitimately stumbles across evidence of
another crime, it will usually serve as probable cause for issuing a new warrant authorizing a search for
evidence of that crime.
86 W. HARRISON
Because the warrant must not be unreasonably broad (to prevent “fishing expedi-
tions”), special care must be taken when describing digital information and/or the
hardware that is to be seized.
If the artifacts to be seized relate to information (e.g., lists of drug suppliers or
on-line purchases), it is generally advised that the warrant should describe the infor-
mation rather than the storage devices on which it resides. Unless the computer in
question contains contraband, is an instrumentality or is the fruit of a crime, it is usu-
ally advisable to extract the information and leave the computer if at all possible. This
is especially true if the computer containing the information is used in a legitimate
business or is not owned by the suspect. For instance, it is not at all uncommon for
computers owned by the suspect’s employer or friend to contain potential evidence.
Courts have begun requiring computers seized from third parties and businesses to
be either examined in-place or promptly returned.
The information may be specified in very particular terms (e.g., “the cookies.txt
file”) or may be very broad (e.g., “all evidence of the user visiting a given web site,”
which may include the cookies.txt file, but also the browser cache, and perhaps even
e-mails if use of the site entailed an e-mail confirmation). On the other hand, a request
for permission to search “all files for proof of criminal activity” on the computer
would likely be construed as too broad to obtain a warrant. Further, even if such a
warrant were issued, it is unlikely that evidence collected under the warrant would
be considered admissible by the trial court, or that the warrant would withstand an
appeal.
If the focus of the search is information and not a particular piece of hardware,
specifying information rather than hardware may also allow broader seizure. For
example, seizure of information from all computers at a location the suspect may
have reasonably used to access the given website may be permitted if information
rather than hardware is described. On the other hand, specification of a single desktop
computer may require information from the suspect’s laptop to be left behind.
If the digital device contains contraband, is an instrumentality as in the metham-
phetamine lab example or is the fruit of a crime, the device itself will probably be
seized and the data extracted off-site. If discovery of the computer and/or peripherals
is incidental to a search for a methamphetamine lab and not included in the original
warrant, a new warrant would be required to seize these items.
file modification times) from being erased or modified. This includes isolating
computers from phone lines and other network connections to prevent data from
being accessed remotely.
• Secure the computer for evidence retrieval, either by leaving the computer
“OFF” if it is not turned on, or if the computer is “ON,” photographing the
screen and then disconnecting all power sources. This is done by unplugging
the computer power cord from the wall and the back of the computer.5 This
advice comes from the concern that a “logic bomb” may exist such that if a cer-
tain shutdown sequence is not followed, all incriminating files are automatically
erased from the computer’s hard drive, much like bookies often record their bets
on flash paper so in the event of a raid, evidence can be obliterated by simply
touching a match to the pages. By simply removing power from the computer
logic bombs that may exist in the computer’s shutdown scripts can be bypassed.
Of course, by turning off a running computer, some fragile, transient data
may be lost. For example, the content of buffers will be lost, and virtual network
drives will be lost. Consequently, some digital forensics experts advise attaching
a SCSI device or using an open network connection to get the results of various
commands and the contents of various environment variables before turning off
a running computer.
In the event the computer being seized is networked or used for business, it is ad-
visable for seizure activities to be assigned to a computer specialist in order to avoid
disrupting legitimate business operations while preserving important evidence.
altered in an unauthorized manner since the time it was created, transmitted, or stored
by an authorized source [9].”
The integrity of the digital content may be at risk both before collection and after it
has been collected. In the case of preserving precollection integrity, the investigator
is dependent upon effective practices during the preservation phase. In particular, this
entails preventing both physical access as well as remote access to the computer or
other digital device.
when the file was created will be “allocated” (i.e., no other file can occupy these
sectors), but will contain only what was there when the file was initially created.
This is called File Slack.
A duplicate of a volume entails a bit-by-bit transfer from one device to another.
Consequently, it will contain the same data, RAM slack and file slack as the original.
Likewise, it will also include the unallocated space from the original. This allows a
forensic analysis to be done as though it were being performed on the original.
On the other hand, a copy will contain the data, but the copy may very well contain
RAM slack and file slack from the computer that did the copying rather than the slack
from the original drive. While a copy may be adequate for file backups and ordinary
file transfers, since evidence can reside in files, RAM slack, file slack, file meta-data
and erased files, a duplicate of the original storage device is typically preferred over
a copy.
Since creating a duplicate entails a “bit-by-bit transfer” (actually it is more ac-
curate to say a “sector-by-sector transfer”), tools to create duplicates can ignore the
specifics of different file systems, since partitions, directory structures, etc. are all
copied from the source device to the destination device with no need for interpreta-
tion. On the other hand, creating a copy (sometimes called a “backup copy”) typically
implies interpretation of the original storage device’s file system since file content are
recognized and copied from source to target.
F IG . 2.
hardware devices designed for use in the field such as the Logicube SF-5000 shown
in Fig. 2 to software tools such as the UNIX dd command used in special controlled
environments.
Unfortunately, while the concept of a bit-by-bit duplicate seems intuitively obvi-
ous, in practice the definition is not so clear. For instance, virtually all disks have
one or more defective sectors when they are shipped from the factory. However, with
99.999999% of the sectors on the disk still usable, there is no reason to discard the
entire drive. These blocks are avoided during use by listing them in a map of bad
sectors. Since the defective sectors are different on each hard drive, even the best
duplication procedure will fail to ensure that every corresponding bit is the same on
both the source and target drives. For instance, Table I indicates sectors containing
data as Di and defective sectors that have been mapped as “X.”
Nevertheless, these two disks would be considered duplicates because each sector
was copied from the source to the target drive. Both the RAM slack and the file slack
would be retained, albeit they may be located in different physical sectors on the two
disks. Other forensically permissible differences also may occur. For instance, unless
exactly the same model hard drives are used, the target drive may be slightly larger
than the source drive, have a different number of cylinders, etc.
TABLE I
Source D1 D2 D3 D4 X D5 D6 D7 X X D8 X D9
Target D1 X D2 D3 D4 D5 X X D6 D7 D8 D9 X
92 W. HARRISON
One approach that several commonly used by forensic tools such as New Tech-
nologies SafeBack [11], Guidance Software’s EnCase [12], and Elliot Spencer’s
iLook [13], which is freely distributed by the Internal Revenue Service to law en-
forcement agencies, is to store the duplicate of the device as a single image file.
Such an image can accurately reflect the various forms of slack and unallocated
space.
Using a disk image as opposed to an actual physical disk to hold the duplicate
requires that specific tools be used to process the image files as well as preventing
the duplicated drive from actually being used (it is considered a bad idea to actually
boot off a drive under examination anyway). However, it allows artifacts such as bad
block maps to be ignored, and allows things like disk compression on the image so a
sparsely populated 80 GB hard drive might have an image that only consumes 10 GB.
Evidence from each of these tools have been admitted as evidence in numerous cases.
Therefore, the technology behind images vs actual duplicate physical disks has been
accepted by the courts.7
A recent project [14] at the National Institute of Standards and Technology (NIST)
led by Jim Lyle undertook an extensive effort to specify the desirable behavior of
forensic disk duplication tools and evaluated the behavior of a number of frequently
used tools against this specification. The specification now provides a formal stan-
dard against which tools can be evaluated.
One of the biggest motivations for insisting on seizing a physical disk rather than
simply creating a duplicate at the scene for analysis is the anticipation of a challenge
in court that the “duplicate” somehow differs from the original. To be safe, many
investigators prefer to seize the physical disk so it can be presented later as proof
that the forensic evidence was not altered.
One approach suggested to address this concern is to create two duplicates of a
hard drive in the presence of the owner or some other disinterested third party. One
of the drives, the control drive, is labeled and sealed. The label is signed by the owner
or third party and is stored in a secure location while the other is used as the original
“evidence disk.” This protects the examiner against a challenge to the authenticity of
the working copy, since the owner’s copy could be unsealed and compared to the one
examined [15].
Regardless of whether an entire computer, the original disk, or a duplicate of the
original disk is obtained from the search and seizure activity, the examination is only
performed on a duplicate of whatever was obtained. The items seized are immedi-
ately placed under physical control in order to establish what is known as the “Chain
of Custody.” The Chain of Custody ensures that the evidence is accounted for at all
times; the passage of evidence from one party to the next is documented; as is the
7 In most cases, the images are only used for examination anyway. The “best evidence” is still the original
disk seized from the suspect’s computer.
THE DIGITAL DETECTIVE: AN INTRODUCTION TO DIGITAL FORENSICS 93
passage of evidence from one location to the next. The location and possession of
controlled items should be traceable from the time they are seized until the time they
appear in court as evidence.
Should the case lead to prosecution, another duplicate of the original storage de-
vice is made available to the Defense under the commonly accepted rules of discov-
ery for their use in evaluating the evidence put forth by the government. In such a
situation, the ability to use a digital signature such as an MD5 hash makes it easy to
ensure that everyone (both the Prosecution and the Defense) has access to the same
information and no tampering of evidence has occurred.
• Memory and flash cards that are used in digital cameras will contain pho-
tographs, but they may also be used to store other digital information as well.
• Printers and FAX machines maintain buffers containing the last items printed
or scanned as well as user logs.
Each of these lend themselves to different methods of data capture and different
evidentiary opportunities. Almost every device is different. For instance, accessing
the buffers in a FAX machine is possible, but requires specific knowledge that is not
readily known by the typical forensic analyst.
Developing a generic framework by which various new devices can be examined
forensically can avoid having to develop new expertise on every case that uses a
different device. Some work towards this is being done [21] through attempts to
formalize computer input and output systems using a specification language called
Hadley. If successful this may provide a generic view of traditional access schemes
such as IDE, EIDE, SCSI, etc.
For instance, in the case of the computer and peripherals found in the metham-
phetamine lab example, the investigator will want to locate evidence that may sup-
port the charge of producing false identification. In this case, the investigator will ask
the examiner to locate files containing forged documents, files containing templates
of ID cards and/or files containing photographs of a variety of individuals suitable
for pasting into a scanned identification card, subject to the constraints of the search
warrant.
F IG . 3.
98 W. HARRISON
Hardware write blockers are simply devices that connect a hard drive to the Ex-
amination Machine’s bus. The device accepts ‘write’ commands but fails to act upon
them while reporting an acknowledgment that it has written the requested data. With-
out a “success acknowledgment” many Windows applications will hang if they do
not receive a signal that the ‘write’ was completed successfully.
The examination environment may consist of special-purpose forensic GUI-based
tools such as EnCase or iLook that provide an interface strikingly similar to a
software development IDE (Integrated Development Environment). Consequently,
such forensic environments may be referred to as Integrated Forensic Environments
(IFEs).
Integrated Forensic Environments can manage all electronic evidence for a case
and provide pull-down menus giving access to most of the commonly used forensic
functions. Some of these functions include searching for text strings, searching for
specific classes of text strings (e.g., e-mail addresses), reconstructing deleted files,
matching and excluding “known good” files, etc.
Conversely, the examination environment may simply be a command line inter-
face that allows the examiner to issue commands. Command line environments such
as Brian Carrier’s @StakeSleuthkit [24] allow extensibility and, as command-line
proponents argue, more control over the use of the tools. For example, Fig. 4 shows
the use of the @Stake environment on a DOS partition. The mmls command displays
the layout of a disk, including the unallocated spaces.
The output of such tools often provide more information to the examiner than
an integrated environment. However, they also tend to be cumbersome to use by
minimally trained personnel. For example, in order to compute an MD5 hash for
a given file, a GUI-based forensic examiner would simply select an option from a
pull-down menu while his command-line counterpart would run md5sum. As can
be expected, both environments have their (vocal) supporters. However, the GUI-
F IG . 4. @Stake tools.
THE DIGITAL DETECTIVE: AN INTRODUCTION TO DIGITAL FORENSICS 99
Because of the size of today’s digital storage devices, the examiner must use some
systematic approach to search for specific textual strings among the thousands of
files. This usually involves the use of either GUI-based or command line-based soft-
ware tools that do efficient string or pattern matching.
TABLE II
File_id 54579
Hashset_id 21
File_name ORDERS.DBF
Directory C:\PROGRA~1\MICROS~2\OFFICE
Hash C5A5113D5493951FE448E8E005A5C136
File_size 989
Date_modified 11/17/96
Time_modified 0:00:00
Time_zone PST
Date_accessed 12/30/99
Time_accessed 0:00:00
Because we know that known good files will not contain evidence, they can be
excluded from string searches. By omitting system and application files from search-
ing, a great deal of time can be saved. For instance, omitting the files installed with
Windows XP from a string search can save up to 1.5 GB of searching. Omitting files
installed with Office XP can save almost 300 MB.
Identifying known good files can be problematic. If files are excluded based on
their file names, it would be an easy job for a criminal to simply rename incriminat-
ing files to the names of files installed with popular applications, minimizing the like-
lihood of a search. The Hashkeeper [29] dataset developed by Brian Deering of the
National Drug Intelligence Center introduced the paradigm of using cryptographic
hashes, such as the MD5, to uniquely identify files to the forensics community.
The Hashkeeper dataset currently contains MD-5 hashes from hundreds of popu-
lar applications that would be expected to be found on most personal user’s computer
hard drives. These range from Operating System installations such as Microsoft 2000
Server to computer games such as Diablo II and reference software such as Broder-
bund Click Art 10,000. The dataset accounts for roughly three quarters of a million
hashes. A typical entry in the Hashkeeper dataset contains the following comma-
delimited fields, see Table II.
An alternate dataset is the National Institute of Standards National Reference Data
Library (NRDL) [30]. The NRDL contains MD5 and SHA-1 cryptographic hashes as
well as a 32-bit CRC checksum of files from operating systems, vertical applications,
database management systems, graphics packages, games, etc. Version 1.4 of the
NRDL contains hashes for over 3300 products and over 10,000,000 separate file
entries.
The NRDL provides a more sophisticated dataset organization than the Hash-
keeper dataset with four comma-delimited tables containing information on Oper-
102 W. HARRISON
TABLE III
SHA-1 00006DB99FED8A329CC81712584F3949147CCB14
MD5 D48BC5EB79A3FAF08E3A119528F55D72
CRC32 A7A335B4
FileName hands003.wpg
FileSize 3846
ProductCode 2524
OpSystemCode WIN
ating Systems, Manufacturers, Products, and of course, the actual file hashes them-
selves, see Table III.
Prior to beginning any string searches the appropriate hashes of each file on the
Evidence Disk can be computed and compared to the hashes in the known good
dataset. Files that match the hashes in the known good dataset can be omitted from
subsequent string searches.
As long as the hash matching and searches are performed at the File System or
Application Layers the idea of “known good” files can be used. However, at less
abstract layers, a file is nothing more than a noncontiguous sequence of bits and
bytes. Hashes cannot be used under these circumstances to discard files from analysis
since there is no way to tell when a file begins or ends.
developed for a number of well-known images that have been documented to portray
minors in pornographic displays.
The use of “known bad” file hashes is particularly significant because child
pornography represents a very large percentage of all criminal digital forensic
effort—some have estimated as much as 60–70% of the effort expended in exam-
ining digital evidence involves child pornography.
In addition to speeding up the search, the collection of hashed images typically
used by law enforcement have also been documented to contain minors through iden-
tification and interview of the subjects. This is particularly important in the United
States where the Supreme Court has ruled the possession of “synthetic child pornog-
raphy” (the manipulation of nonsexual images of children into images of them engag-
ing in sex acts—for example, by pasting the face of a child onto the body of an adult
using an image manipulation program such as Photoshop) is not a crime.10 Conse-
quently, documentation of images as actually depicting child porn can be very labor
intensive. Once an image is documented, using the MD5 hash to uniquely identify it
provides an additional degree of efficiency when investigating child pornography.
As is the case in identifying “known good,” tools exist that compute and compare
the MD5 hash for every file on a computer against a list of MD5 hashes of docu-
mented images of known child pornography.
The use of cryptographic hashes to represent specific images also circumvents the
problem of maintaining contraband items. By simply maintaining the database of
hashes (as opposed to the images themselves) agencies can determine if a suspect
possesses child porn on his computer while avoiding the security and control neces-
sary if contraband is maintained on-site.
Unfortunately, even small changes such as cropping can alter the MD5 hash of an
image, so it is quite easy to circumvent “known bad” searches. However, hashing “re-
gions” and comparing those for bit mapped images such as JPEGs may help address
this problem. However, no current “known bad” hash sets have taken this approach.
Further, because most crimes do not lend themselves to a database of “known bad”
images this technique has limited applicability.
Another common practice among criminals in the possession of incriminating im-
ages (e.g., state identification templates for use in identify theft or forgery) is to
change the file name and extension. For instance, the image file “ODL_Template.jpg”
can easily be renamed “OT.doc.” This is frequently done in the hopes that an exam-
iner may miss the file when manually examining images.
10 The 1996 Child Pornography Prevention Act originally had made possession of synthetic child pornog-
raphy illegal, but the U.S. Supreme Court struck this aspect of the law down as a violation of First Amend-
ment rights in April 2002.
104 W. HARRISON
Forensic tools are often used that compare file names with the contents of the file.
A common image format is the JPEG (Joint Picture Expert’s Group File Interchange
Format) format. JPEG image files begin with the following 4 byte header:
• FFD8 (Start Of Image marker);
• FFE0 (JFIF marker).
As files are analyzed by forensic search tools, the first four bytes of each file can be
compared against the file name extension. Files that present discrepancies, such as a
word processing document that actually begins with a JPEG header can be flagged
specifically for manual examination. Ironically, the very act of attempting to obfus-
cate an incriminating file actually draws attention to it during a forensic examination.
Some forensic applications provide galleries of thumbnails for every image on the
Evidence Disk as can be seen in the accompanying thumbnail screen (see Fig. 5).
The forensic examiner can quickly peruse a screen full of images at a time, looking
F IG . 5.
THE DIGITAL DETECTIVE: AN INTRODUCTION TO DIGITAL FORENSICS 105
for possible evidentiary items. However, Constitutional concerns are an issue, since
examining a large set of thumbnails may exceed the scope of a warrant.
Nevertheless, searching images for evidence can be very time consuming. Evi-
dence Disks with tens of thousands of different images are not uncommon.
Fortunately, many graphic images are commercial clipart such as the ones that
come with Microsoft Office. Therefore, they may also be omitted from analysis
through the use of known good filtering. This may reduce the number of image files
the examiner must view by hundreds or even thousands.
The date and time of the deletion as well as the original path and naming information
is added to a system file in the Recycle Bin folder called INFO.
Neither the deleted file’s MFT entry nor its content are physically deleted. If new
entries are added to the MFT the deleted MFT entries are overwritten by NTFS
before extending the MFT. However, absent the creation of new files or directories,
a deleted file and all its content will continue to be available via both the MFT and
the Recycle Bin MFT entry. Even if the deleted MFT entry is overwritten, if the data
is nonresident, it will probably still be physically retained on the volume.
While the file is still in the Recycle Bin, it is fully accessible using the Explorer,
and no special tools or techniques are necessary. Most computer users do not habit-
ually empty their Recycle Bin every day, and deleted files tend to build up.
However, when the user does empty the Recycle Bin, the Recycle Bin indexes are
cleared and the MFT entries for both the files stored in the Recycle Bin and the INFO
system file are marked as deleted by setting the file state attribute to “0.” However,
as with a regular file, the MFT entry for the INFO file will not be overwritten until
new files are added.
Even if the MFT entries are entirely obliterated, the contents of nonresident at-
tributes (e.g., the contents of a file) may remain accessible at the media management
level for an extended period of time after the file is deleted.
Additionally, such a system may have hundreds of users, all of whom have a rea-
sonable expectation of the privacy of the information they store in “their” directories.
Typically a search warrant would not provide blanket permission to simply snoop
through all these unrelated files, and in fact, special rules exist with respect to “pri-
vate electronic communication” (i.e., e-mail) on “remote computing services” (i.e.,
service providers such as ISPs).
Significant issues exist regarding how one can acquire admissible evidence from
a live, multi-user system without unacceptably degrading performance, while at the
same time avoiding access to “unnecessary information.” Obviously a live system is
continually modifying logs, processes start and stop, files are created, accessed, and
deleted. While the technical aspects of capturing such information is relatively easy,
it may be difficult to demonstrate authenticity when the case comes to trial.
3.7.2 Authenticity
Before specific digital information can be admitted as evidence it must be shown
to be authentic, or in other words, it must be shown to actually be what it is
claimed to be. This is usually done by the testimony of someone that has first-
hand knowledge of the digital information. For example, a police officer can tes-
tify that a hard drive is the same one that was seized from a computer in the
defendant’s residence, or a bank officer can testify to the authenticity of bank
records.
Challenges to the authenticity (and therefore admissibility) of digital information
often take on one of three forms:
1. Digital information can be easily altered, and it can be suggested by the De-
fense that digital information may have been changed or altered (either mali-
ciously or inadvertently) after the information was seized. However, without
specific evidence that tampering occurred such as MD5 hashes that do not
match, the mere possibility of tampering has been ruled to not affect the au-
thenticity of digital information.
2. If digital information such as logs, meta data, a file’s last modified date, etc. has
been created by a computer program the authenticity of the information hinges
on the reliability of the computer programs that created the data. If the pro-
gram is shown to have programming errors which could lead to its output being
inaccurate the information may not be what it is claimed to be. Because pro-
gramming errors are so prevalent among commercial software this could raise
serious problems when introducing any computer-generated evidence. How-
ever, the courts have indicated that this challenge can be overcome as long as
the information can be considered trustworthy. For instance, the trustworthi-
ness of a computer program can be established by showing that users of the
program rely on it’s output on a regular basis. Once a level of trustworthi-
ness has been established, challenges to the accuracy of the digital informa-
tion generated by the computer program affect only the weight (i.e., degree
to which the jury considers the evidence) of the evidence, not its admissibil-
ity.
3. If the digital information consists of the electronically recorded writings of a
person such as e-mail, instant messages, word processing documents, or chat
room messages, the statements contained in the digital information must be
shown to be truthful and accurate. The authenticity of such evidence is usu-
ally challenged by questioning the author’s identity—in other words, how do
we know the person actually is the one that produced the document? Evidence
such as ISP logs or the contents of a user’s “Sent” mail folder may end up being
used to authenticate such evidence.
THE DIGITAL DETECTIVE: AN INTRODUCTION TO DIGITAL FORENSICS 109
Some operations the forensic analyst might perform are fairly straightforward, even
to a layperson with minimal computer knowledge. On the other hand, many activities
the investigator may perform are not so straightforward.
For instance, retrieving deleted files requires a certain level of technical expertise
in order to understand how reliable and valid the technique is. For instance, one
might question the likelihood that a “retrieved” file containing the string “I did it,” is
actually the reconstructed last eight bytes of the string “I agreed to return the money,
I did, and I am glad I did it.” Or, that there were not actually two files, one that
contained “ID” and one that contained “IT?” An examiner must be ready to describe
and defend the actions taken to “reconstruct” a deleted file since a juror is more likely
to have a “reasonable doubt” if they do not understand the procedures undertaken to
collect the evidence.
Currently, the Daubert ruling is used in Federal Courts and in the courts of a num-
ber of states. In 1999, Kumho Tire vs. Carmichael, extended Daubert to nonscien-
tific expert testimony. Other non-Daubert states use a variety of tests in determin-
ing whether expert opinion should be admitted into evidence or not. In general, the
Daubert standard is both the most restrictive and has the most general applicability.
Consequently, new techniques should be evaluated under the six rules listed above.
110 W. HARRISON
found was in the living room. When located, the computer was already turned off.
Since the forensic technician was not able to accompany them on the seizure, the
detective labeled each cable with a piece of tape, and carefully disconnected each
from the computer. The computer was placed in the trunk of a patrol car and the
suspect was given a receipt for all items taken.
The computer was taken to the Police Computer Forensic Lab, where it was tagged
and all pertinent serial numbers written down. Afterwards, the forensics technician
removed the seized machine’s 40 GB hard drive and signed it out in order to maintain
the Chain of Custody. The technician selected a disk of comparable size to serve as
the examination disk and verified that the drive had been “scrubbed” to remove any
data that may be on the disk.
The technician first computed the MD5 hash for the evidence disk.
The technician placed both the evidence (source) disk and the examination (target)
disk in a hardware device called a Logicube SF-5000 [32], a special purpose IDE disk
duplication tool. The original hard disk was returned to the Evidence Locker where
it was stored with the rest of the computer.
The detective asked the technician to locate evidence that would tie the suspect to
the order that was placed at http://www.pdxGolfing.com. Upon some investigation,
the technician found that the http://www.pdxGolfing.com retail site uses cookies to
persistently maintain the shopping cart while the customer is shopping.
A cookie is an entry on the user’s hard drive where Internet applications can store
information during a session. Usually the cookie entry includes the name of the do-
main the user was using when the cookie was recorded, as well as selected pieces of
information the Internet application may wish to keep available. A cookie may last
for only while the session is active, or it may be saved for a longer period of time so
the user can come back later and pick up where they left off.
The examination disk was placed in an external drive bay connected to an exami-
nation machine and a quick inventory of the files was made. Noting that the suspect
used the Microsoft Internet Explorer Web browser, the forensic technician examined
the \Documents and Settings\user\Cookies directory. If the user had used Internet
Explorer to place the order, this directory would contain a cookie providing evidence
of this fact.
F IG . 6.
As soon as the technician opened the Cookies folder, he found a file called
user@www.pdxgolfing.com[2].txt, see Fig. 6.
This file contains the cookies deposited by applications located at the pdxgolf-
ing.com domain. Upon opening the file, the technician saw the following two lines:
ordernumberODL79365www.pdxgolfing.com/
1536242056576029578815146
itemsXLNT_Golf_Clubs-1045www.pdxgolfing.com/
1536242056576029578
These entries immediately show not only that the suspect’s computer had visited this
site before, but that the user had placed order number ODL79365 for a set of XLNT
golf clubs.
After the forensic analysis, the detective has sufficient evidence to link the sus-
pect with the fraudulent order placed on the Internet. This escalates the crimes the
suspect can be charged with from Criminal Mischief III, a Class C Misdemeanor, to
at least three Class C Felonies:12 Computer Crime, Fraudulent Use of a Credit Card
12 These crimes are based on Oregon Revised Statutes: ORS 164.377, ORS 165.055 and ORS 165.800,
the specific crimes the suspect may be charged with will vary from state to state.
THE DIGITAL DETECTIVE: AN INTRODUCTION TO DIGITAL FORENSICS 113
(a Class C Felony since the amount is over $750), as well as Identify Theft. This has
dramatically raised the stakes for the suspect from a fine of under $1000 and/or 30
days in the county jail to three counts, each of which could result in a fine of up to
$100,000 and/or 5 years in prison.
F IG . 7.
114 W. HARRISON
used to show that he was the only one with access to the computer at this particular
point in time. If this is indeed the case, it would cast serious doubt upon his claim
that he had nothing to do with the on-line purchase.
Several organizational models exist for a digital forensics capability within Law
Enforcement. There is not currently any one model that has become standardized.
The organizational model is important because it heavily influences the level of ex-
pertise, training and resources that may be brought to bear in creating and using new
technology.
One popular approach is a distributed structure. In this organizational structure,
individual agencies each have their own physical facilities (often just a small room,
a forensics computer and an external drive) and one or two investigators that have
been trained at some level in the technology of digital forensics. In the distributed or-
ganizational model, agencies work on their own cases, sometimes sharing resources
with other nearby agencies as time and circumstances permit.
Another common approach is a cooperative model in which both resources and
personnel are pooled by several agencies to create a regional digital forensics labora-
tory. Usually the laboratory services a relatively limited geographical area—perhaps
adjoining counties for a metropolitan area. Cases are also pooled with no considera-
tion of jurisdiction. For example, if Smallville Police contributes a trained examiner
to the regional laboratory, and later they request services, the case may go to any
of the examiners, and not necessarily the one contributed by Smallville. This is the
model favored by the Federal Bureau of Investigation’s Regional Computer Forensic
Laboratories (RCFLs). The RCFL model has proven quite effective at mobilizing
resources within a geographic area. Currently, eight FBI RCFLs are either in place
or under construction: San Diego, California; Dallas, Texas; Kansas City, Missouri;
Chicago, Illinois; Buffalo, New York; Newark, New Jersey; Portland, Oregon; and
Salt Lake City, Utah.
A very prevalent model is a “service-based” model. In this model, local agencies
do not have any digital forensic capabilities themselves. Rather, they may seize an
entire computer and send it to a centralized facility, often the state crime lab, or the
Federal Bureau of Investigation [34] for analysis.
Each of these different organizational structures has both strong and weak points.
For example, in the distributed structure, individual forensic investigators tend to be
relatively isolated, and lack colleagues to “bounce ideas off.” On the other hand, the
agency has full control over its cases, can track their progress, etc. The cooperative
model may result in an agency’s cases being given lower priority. On the other hand,
investigators from different agencies can work in a collegial environment and create
a “critical mass” resulting in innovation and growth.
THE DIGITAL DETECTIVE: AN INTRODUCTION TO DIGITAL FORENSICS 117
Because of its applied nature, useful research in the field will most certainly manifest
itself as tools that can be ultimately used in forensic investigations. However, some
of the most promising advancements in this area will likely entail adaptation of work
in other fields such as database, algorithms, graphics, and software engineering.
118 W. HARRISON
8. Conclusions
This chapter has provided an introduction to the field of digital forensics and a
brief overview of some of the technology being used. The field has matured greatly
since the 1980s when investigators used Microsoft’s DEBUG to search for deleted
files on MS-DOS machines, and dealt with 5 MB hard drives.
The insular past of the community has restricted participation to practitioners until
very recently. With the increased participation of computer science researchers, new
techniques and capabilities can be expected.
R EFERENCES
[1] Flynn M.K., “Computer crime scenes”, PC Magazine (19 February 2002).
[2] Yasinac A., Erbacher R., Marks D., Pollitt M., Sommer P., “Computer forensics educa-
tion”, IEEE Security and Privacy (July/August 2003) 15–23.
[3] Hosmer C., “Proving the integrity of digital evidence with time”, International Journal
of Digital Evidence 1 (1) (2002).
[4] U.S. Federal Rules of Evidence, U.S. Government Printing Office, 1 December 2001.
[5] “Searching and seizing computers and obtaining electronic evidence in criminal in-
vestigations, computer crime and intellectual property section, Criminal Division,
United States Department of Justice”. Available at: http://www.usdoj.gov/criminal/
cybercrime/s&smanual2002.htm, July 2002.
[6] Reith M., Carr C., Gunsch G., “An examination of digital forensic models”, International
Journal of Digital Evidence 1 (3) (2002).
[7] C. Wade (Ed.), FBI Crime Scene Search, 1999. Available at: http://www.fbi.gov/hq/
lab/handbook/scene1.htm.
[8] “U.S. Secret Service, Best practices for seizing electronic evidence”. Available at: http://
www.secretservice.gov/electronic_evidence.shtml, 2002.
[9] Vanstone S., van Oorschot P., Menezes A., Handbook of Applied Cryptography, CRC
Press, 1997.
[10] Scientific Working Group on Digital Evidence and International Organization on Digital
Evidence, “Digital Evidence: standards and principles”, Forensic Science Communica-
tions 2 (2) (2000).
[11] http://www.forensics-intl.com/safeback.html.
[12] http://www.guidancesoftware.com/.
[13] http://www.ilook-forensics.org/.
[14] Lyle J., “NIST CFTT: testing disk imaging”, International Journal of Digital Evi-
dence 1 (4) (2003).
[15] Bates J., “Fundamentals of computer forensics”, International Journal of Foren-
sic Computing (January/February 1997). Available at: http://www.forensic-computing.
com/archives/fundamentals.
THE DIGITAL DETECTIVE: AN INTRODUCTION TO DIGITAL FORENSICS 119
[16] Rivest R., The MD5 Message-Digest Algorithm, RFC-1321, MIT LCS and RSA Data
Security, Inc., April 1992.
[17] FIPS Publication 180-1, Secure Hash Standard, April 1995.
[18] Kornblum J., “Preservation of fragile Digital Evidence by first responders”, in: 2nd An-
nual Digital Forensics Research Workshop, August 2002.
[19] Willassen S.Y., “Forensics and the GSM mobile telephone system”, International Jour-
nal of Digital Evidence 2 (4) (2003).
[20] Grand J., “pdd: memory imaging and forensic analysis of Palm OS devices”, in: Pro-
ceedings of the 14th Annual Computer Security Incident Handling Conference, June
2002.
[21] Gerber M.B., Leeson J.J., “Shrinking the Ocean: formalizing I/O methods in modern
operating systems”, International Journal of Digital Evidence 1 (2) (2002).
[22] International Organization on Computer Evidence, “Good practices for seizing elec-
tronic devices”, in: Notes from the International Organization on Computer Evidence
2000 Conference, Rosny sous Bois, France, December 2000.
[23] “Hard Drive Software Write Block Tool Specification and Test Plan”, Draft Version 3.0,
National Institute of Standards and Technology, May 2003.
[24] http://www.sleuthkit.org/sleuthkit/.
[25] Philips L., “The double metaphone search algorithm”, C/C++ Users Journal (June
2000).
[26] http://www.vogon-computer-evidence.us/gentree_software.htm.
[27] NTRCFL website: http://www.ntrcfl.org/.
[28] Carrier B., “Defining digital forensic examination and analysis tools using abstraction
layers”, International Journal of Digital Evidence 1 (4) (2003).
[29] http://www.hashkeep.org.
[30] National Institute of Standards, “National Software Reference Library (NSRL) Project
Web Site”. Available at: http://www.nsrl.nist.gov/index.html.
[31] Kerr O.S., Computer Records and the Federal Rules of Evidence, Computer Crime and
Intellectual Property Section, Criminal Division, United States Department of Justice,
March 2001.
[32] http://www.logicube.com/products/hd_duplication/sf5000.asp.
[33] Stambaugh H., Beaupre D.S., Icove D.J., Baker R., Cassaday W., Williams W.P., “As-
sessment for State and Local Law Enforcement”, U.S. Department of Justice Report,
NCJ 186276, March 2001.
[34] “FBI Handbook of Forensic Services, Computer Examinations”. Available at: http://
www.fbi.gov/hq/lab/handbook/examscmp.htm.
This page intentionally left blank
Survivability: Synergizing Security
and Reliability
CRISPIN COWAN
Immunix, Inc.
920 SW 3rd Avenue
Portland, OR 97204
USA
crispin@immunix.com
Abstract
In computer science, reliability is the study of how to build systems that con-
tinue to provide service despite some degree of random failure of the system.
Security is the study of how to build systems that provide privacy, integrity, and
continuation of service. Survivability is a relatively new area of study that seeks
to combine the benefits of security and reliability techniques to enhance system
survivability in the presence of arbitrary failures, including security failures. De-
spite apparent similarities, the combination of techniques is not trivial. Despite
the difficulty, some success has been achieved, surveyed here.
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2. The Problem: Combining Reliability and Security . . . . . . . . . . . . . . . . . . . 122
3. Survivability Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.1. Design Time: Fault Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.2. Implementation Time: Writing Correct Code . . . . . . . . . . . . . . . . . . . 129
3.3. Run Time: Intrusion Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.4. Recovery Time: Intrusion Tolerance . . . . . . . . . . . . . . . . . . . . . . . . 133
4. Evaluating Survivability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.1. Formal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.2. Empirical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
1. Introduction
At first glance, reliability and security would seem to be closely related. Security
is defined [71] as privacy, integrity, and continuation of service, the latter seeming to
encompass a degree of reliability. Conversely, reliability is defined as systems that
mask faults to prevent failures of systems to provide their specified services [70].
The combination reliability and security is a natural fit: both seek to improve sys-
tem availability, both must deal with failures and their consequences. Techniques to
ensure software quality such as source code auditing, type safe languages, fault iso-
lation, and fault injection all work well for both security and reliability purposes;
improving one tends to improve the other.
However, interpreting security “faults” (vulnerabilities) as failures in the reliability
sense has proven to be problematic. Failure is defined as “deviation from specifica-
tion” which is not helpful if the specification itself is wrong. More over, many real
systems are implemented in type-unsafe languages (especially C) and so correspon-
dence to a formal specification cannot easily be assured. Thus reliable software does
what it is supposed to do, while secure software does what it is supposed to do, and
nothing else [2]. The surprising “something else” behaviors form the crux of the
software vulnerability problem.
So security and reliability cannot be trivially composed to achieve the benefits of
both. Survivability is the study of how to combine security and reliability techniques
to actually achieve the combined benefits of both. The intersection of the two is to
be able to survive failures of the security system. The union of the two is to survive
arbitrary failures, including the security system, not just random failures.
The rest of this chapter is organized as follows. Section 2 describes the problem of
composing security and reliability techniques in greater detail. Section 3 surveys and
classifies survivability techniques. Section 4 surveys methods of assessing surviv-
ability. Section 5 describes related work surveying survivability. Section 6 presents
conclusions.
A crucial assumption to fault tolerance is that faults are independent: that there
is no causal relationship between a fault in one component and a fault in another
component. This assumption breaks down with respect to security faults (vulnera-
bilities) because replication of components replicates the defects. Attackers seeking
to exploit these vulnerabilities can readily compromise all replicas, inducing failure.
Thus survivable systems must provide something more than replication to be able
to survive the design and implementation faults that are at the core of the security
problem.
Reliability also assumes that faults are random, while security cannot make such
an assumption. For instance, a system that depends on random memory accesses not
hitting the address 0x12345678 can be highly reliable (assuming no data structures
are adjacent) but is not secure, because the attacker can aim at an address that is
otherwise improbable. In the general case, the attacker can maliciously induce faults
that would otherwise be improbable. Thus the traditional reliability techniques of
redundancy and improbability do not work against security threats.
The traditional security approach to masking security faults is prevention: either
implement with such a high degree of rigor that vulnerabilities (exploitable bugs) do
not occur, or else design in such a way that any implementation faults that do occur
cannot manifest into failures. Saltzer and Schroeder canonicalized these prevention
techniques in 1975 [71] into the following security principles:
1. Economy of mechanism: designs and implementations should be as small and
simple as possible, to minimize opportunities for security faults, i.e., avoid
bloat.
2. Fail-safe defaults: access decisions should default to “deny” unless explicitly
specified, to prevent faults due to unanticipated cases.
3. Complete mediation: design such that all possible means of access to an object
are mediated by security mechanisms.
4. Open design: the design should not be secret, and in particular, the design
should not depend on secrecy for its security, i.e., no “security through ob-
scurity.”
5. Separation of privilege: if human security decisions require more than one hu-
man to make them, then faults due to malfeasance are less likely.
6. Least privilege: each operation should be performed with the least amount of
privilege necessary to do that operation, minimizing potential failures due to
faults in that privileged process, i.e., don’t do everything as root or admin-
istrator.
7. Least common mechanism: minimize the amount of mechanism common
across components.
8. Psychological acceptability: security mechanisms must be comprehensible and
acceptable to users, or they will be ignored and bypassed.
124 C. COWAN
These principles have held up well over time, but some more than others. Least
privilege is a spectacular success, while least common mechanism has failed to com-
pete with an alternate approach of enhanced rigor applied to common components
that are then liberally shared.
Unfortunately, these techniques also turn out to be too expensive. They are hard
to apply correctly, succeeding only rarely. When they do succeed in building highly
secure (invulnerable) systems, the result is so restricted and slow that it tends to fail
in the market place, having been eclipsed by less secure but more featureful systems.
So in practice, Saltzer and Schroeder’s techniques fail most of all for lack of being
applied in the first place. Security faults are thus inevitable [26]. Survivability is then
the study of how to mask security faults, and do so such that attackers cannot bypass
the fault masking. Section 3 examines how security faults can be effectively masked.
3. Survivability Techniques
In Section 2 we saw that redundancy and improbability are insufficient to mask
security faults against an intelligent adversary, because the adversary can deliberately
invoke common mode failures. How then to mask unknown security faults?
Colonel John R. Boyd, USAF, defined a strategy called OODA: Observe, Orient,
Decide, and Act [42]. These four steps describe specifically how a fighter pilot should
respond to a threat, and the approach generalizes to other fields of conflict, including
computer security, which plays out as follows:
1. Observe data that might indicate a threat.
2. Orient by synthesizing the data into a plausible threat.
3. Decide how to react to that threat.
4. Act on the decision and fend off the threat.
This strategy is often used in computer survivability research to build adaptive
intrusion response systems. These systems detect intrusions (using IDS/Intrusion
Detection Systems) and take some form of dynamic action to mitigate the intrusion.
They precisely follow Boyd’s OODA loop.
Survivability techniques vary in the time frame in which the intrusion detection
and response occur. Boyd advocated tight OODA loops to “get inside” the adver-
sary’s control loop, acting before the adversary can respond. In computer systems,
tight response loops have the distinct advantage of preventing the intrusion from
proceeding very far, and thus prevent most of the consequent damage.
However, larger OODA loops are not without merit. Taking a broader view of
intrusion events enables more synthesis of what the attacker is trying to do, producing
better “orientation” (in Boyd’s terms) and thus presumably better “decisions.”
SURVIVABILITY: SYNERGIZING SECURITY AND RELIABILITY 125
specifications can be created that are both succinct (easy to create and to verify) and
precise (closely approximate least privilege).
Access control schemes originally used an abstraction of controlling interactions
among users on a time-share system. But by the mid-1990s, most computers had
become single-user: either single-user workstations, or no-user network servers that
do not let users log in at all and instead just offer services such as file service (NAS),
web service, DNS, etc., and thus user-based access controls schemes became cum-
bersome. To that end, survivability research has produced several new access control
mechanisms with new abstractions to fit new usage patterns:
• Type enforcement and DTE: Type enforcement introduced the idea of abstract-
ing users into domains, abstracting files into types, and managing access control
in terms of which domains can access which types [11]. DTE (Domain and Type
Enforcement [3,4]) refined this concept.
• Generic Wrappers: This extension of DTE [38] allows small access control
policies, written in a dialect of C++, to be dynamically inserted into a running
kernel. A variation of this concept [5] makes this facility available for Microsoft
Windows systems, but in doing so implements the access controls in the DLLs
(Dynamically Linked Libraries) instead of in the kernel, compromising the non-
bypassability of the mechanism.
• SubDomain: SubDomain is access control streamlined for server appliances
[23]. It ensures that a server appliance does what it is supposed to and nothing
else by enforcing rules that specify which files each program may read from,
write to, and execute. In contrast to systems such as DTE and SELinux, SubDo-
main trades expressiveness for simplicity. SELinux can express more sophis-
ticated policies than SubDomain, and should be used to solve complex multi-
user access control problems. On the other hand, SubDomain is easy to manage
and readily applicable. For instance, Immunix Inc. entered an Immunix server
(including SubDomain) in the Defcon Capture-the-Flag contest [20] in which
TABLE I
L AMPSON ’ S A CCESS C ONTROL M ATRIX
3.1.2 Firewalls
Initially designed to protect LANs from the outside Internet [19], firewalls pro-
gressed to being deployed within LANs to compartmentalize them, much the way
mandatory access control systems compartmentalized time-share users [8]. In recent
self-identified survivability research, the Secure Computing ADF card [67,65] iso-
lates security breaches inside insecure PCs by providing an enforcing firewall NIC
managed from somewhere other than the PC itself. Thus compromised machines can
be contained from a management console.
In early 2000, a new class of attacks appeared: DDoS (Distributed Denial-of-
Service) attacks, in which an attacker co-opts a large number of weakly defended
computers around the Internet, and then commands them all to flood a victim’s ma-
chine with traffic. These attacks are very difficult to defend against, especially for
public web sites intended to allow anyone on the Internet to submit requests. De-
fenses against such attacks are either to trace back the origin of the attacks [72]
or to attempt to filter the DDoS traffic from legitimate traffic somewhere in the
network to block the flood [34,66,84]. Work in this area has lead to commercial
ventures such as Arbor Networks (www.arbornetworks.com) and Mazu Networks
(www.mazunetworks.com).
The problem with DDoS defenses is that they are subject to an arms race. The
detection technologies are relying upon synthetic artifacts of the bogus data that the
DDoS agents are generating. As the DDoS agents become more sophisticated, the
data will come to more closely resemble legitimate traffic. In principle, there is no
reason why DDoS traffic cannot be made to look exactly like a heavy load of real
traffic: this is a fundamental difference between DoS attacks and misuse attacks, that
DoS attacks need not deviate at all from real traffic. When faced with DDoS traffic
that is identical to real traffic, filtering will become either ineffective or arbitrary.
SURVIVABILITY: SYNERGIZING SECURITY AND RELIABILITY 129
Traceback has a better chance of success in the face of sophisticated attack, but will
be labor-intensive until the Internet itself changes to support effective traceback of
traffic.
There are several dimensions in which intrusion prevention techniques can be clas-
sified. We present a 3-dimensional view, intended to classify together technologies
that achieve similar goals:
• Network vs. Host: Intrusion prevention can be done either at the network layer
or within the host. Network intrusion detection is much easier to deploy, but
because there is little context in network traffic, it is possible for attackers to
evade network intrusion detection methods [69,76].
• Detection vs. Prevention: Some tools only detect intrusions, while others re-
spond to intrusion events and shut the attackers down (closing the OODA loop).
Prevention is effectively detection + response.
• Misuse vs. Anomaly: Some systems characterize and arrest known system mis-
use and allow everything else (misuse detection) while others characterize nor-
mal system behavior and characterize anomalous behavior as an intrusion. Mis-
use detection is fast and accurate, but fails to detect novel attacks. Anomaly
detecting can detect novel attacks, but is subject to a high false positive rate
(complaints about traffic that is actually legitimate) [57].
Populating this array of properties, we get:
• Network
– Detection
∗ Misuse: This area is dominated by commercial NIDS (Network Intrusion
Detection) products such as the commercial ISS RealSecure and the open
source SNORT.
∗ Anomaly: The ability to detect novel attacks has generated keen interest in
this area of research [55,86], but little of it has had real-world impact due to
high false positive rates. Industrial applications of intrusion detection de-
mand very low false-positive rates, because non-trivial false-positive rates
combined with high bandwidth lead to high staffing requirements.
– Prevention
∗ Misuse: Network misuse prevention emerged as a commercial market in
2002, taking the more reliable parts of network intrusion detection sys-
tems and placing them in-line with the network connection, acting much
like an application-level firewall with misuse prevention rules. Example
systems include the Hogwash and Inline-SNORT NIPS (Network Intru-
sion Prevention Systems).
∗ Anomaly: Network anomaly prevention is a new way of looking at clas-
sic firewalls, which permit traffic with specified source and destination IP
addresses, ports, and protocols, and deny all other traffic.
SURVIVABILITY: SYNERGIZING SECURITY AND RELIABILITY 131
• Host
– Detection
∗ Misuse: This area is referred to as HIDS (Host Intrusion Detection System)
typified research projects such as EMERALD [68], STAT [88]. These sys-
tems actually use a combination of misuse and anomaly detection. Com-
mercial HIDS similarly use a combination of anomaly and misuse detec-
tion, and also provide for both detection and prevention, as exemplified by
products such as Zone Alarm and Norton Personal Firewall.
∗ Anomaly: There are a variety of ways to do host anomaly detection, de-
pending on which factors are measured, and how “normal” and “anom-
alous” are characterized. Forrest et al. [35] monitor sequences of system
calls, ignore the arguments to system calls, and look for characteristic
n-grams (sequences of length n) to distinguish between “self” (normal)
and “non-self” (anomalous). Eskin et al. [32] generalize this technique to
look at dynamic window sizes, varying n. Ghosh et al. [40] use machine
learning to characterize anomalous program behavior, looking at BSM log
records instead of system call patterns. Michael [62] presents an algorithm
to find the vocabulary of Program Behavior Data for Anomaly Detection,
so as to substantially reduce the volume of raw, redundant anomaly data to
be considered. Tripwire [51,45] does not look at program behavior at all,
and instead detects changes in files that are not expected to change, based
on a checksum; files are profiled in terms of their probability of change, so
that, e.g., changes in /var/spool/mail/jsmith are ignored (e-mail
has arrived) while changes in /bin/login are considered very signifi-
cant.
– Prevention
∗ Misuse: The most familiar form of host misuse prevention is antivirus soft-
ware that scans newly arrived data for specific signatures of known mali-
cious software. However, this very narrow form of misuse is also very
limited, in that it must be constantly updated with new virus signatures, a
limitation made obvious every time a virus becomes widespread before the
corresponding signature does, causing a viral bloom of Internet mail such
as Melissa [14], “I Love You” [15], or Sobig.F [17]. Host misuse preven-
tion can be either provided by the environment, or compiled in to software
components.
+ Kernel: In the kernel environment, an exemplary system is the Openwall
Linux kernel patch [29] which provides both a non-executable stack seg-
ment to resist buffer overflow attacks, and also prevents two pathological
misuses of hard links and symbolic links. PaX [82] generalizes Open-
wall’s non-executable stack segment to provide non-executable heap
132 C. COWAN
Domain [23] and load them into the standard kernels they get with commercially
distributed Linux systems.
Finally, we return to Boyd’s OODA Loop. In the above technologies, those marked
as “prevention” provide their own built-in intrusion mitigation mechanisms (usually
fail-stop) and thus provide a very tight OODA loop. Those marked as “detection”
need to be composed with some other form of intrusion mitigation to actually be
able to enhance survivability.
This longer OODA loop sacrifices the responsiveness that Boyd so highly prized,
in favor of more sophisticated analysis of intrusion events, so as to gain greater pre-
cision in discerning actual intrusions from false-alarms due to subtle or ambiguous
intrusion event data. For instance, IDIP [74] provides infrastructure and protocols
for intrusion sensors (network and host IDS) to communicate with analyzers and
mitigators (firewalls embedded throughout the network) to isolate intrusions.
The CIDF (Common Intrusion Detection Framework) project was a consortium-
effort of DARPA-funded intrusion detection teams to build a common network lan-
guage for announcing and processing intrusion events. CIDF tried to provide for
generality using Lisp-based S-expressions to express intrusion events. Unfortunately,
CIDF was not adopted by intrusion detection vendors outside the DARPA research
community. Subsequent attempts to build IETF [9] standards for conveying intrusion
events have yet to achieve significant impact.
• The site must employ staff with expertise in each of the heterogeneous systems
present. Many systems administrators know only a few systems, and those that
know many systems tend to be senior and cost more.
• The site must patch each of these systems, and to the extent that vulnerabilities
are not common, the patching effort is multiplied by the number of heteroge-
neous systems present. One study [44] found that a site with an infrastructure
of nine NT servers and eight firewalls, for example, would have needed 1315
updates during the first nine months of 2001.
• Multiple versions of applications must be purchased or developed, incurring
additional capital and support expenses.
So while heterogeneity can be effective at providing survivability, it comes at a
substantial cost. It is not yet clear whether the costs of heterogeneity outweigh the
benefits. However, this problem is not specific to the heterogeneity defense; actual
survivability is difficult to assess regardless of the methods employed. Section 4
looks at survivability evaluation methods.
4. Evaluating Survivability
The security assurance problem is “How can I tell if this system is secure?” To
solve that, one must answer “Will this program do something bad when presented
with ‘interesting’ input?” Unfortunately, to solve that, one must solve Turing’s Halt-
ing Problem [85], and Turing’s theorem proves that you cannot, in general, write a
program that will examine arbitrary other programs and their input and determine
whether or not they will halt.
Thus in the fully general case, the security assurance problem cannot be statically
decided automatically, and so other means must be employed to determine the se-
curity assurance of a system. Because determining the actual security of systems is
so problematic, security standards such as the TCSEC (“Orange Book”) and Com-
mon Criteria turned instead to documentation of how hard the developers tried to
provide security, by verifying the inclusion of security enhancing features (access
controls, audit logs, etc.) and the application of good software engineering practice
(source code control, design documentation, etc.) and at higher levels of certification,
a degree of testing.
The question of “How survivable is this system?” is even more problematic, be-
cause the survivability question entails assuming bugs in the software, further weak-
ening assumptions on which to base assurance arguments. Section 4.1 looks at stud-
ies to evaluate survivability through formal methods, and Section 4.2 looks at empir-
ical evaluations of survivability.
136 C. COWAN
Turing shows that you cannot write a static analyzer to determine whether a program
will behave badly when given arbitrary input, you cannot test for whether a program
will behave badly when given arbitrarily bad input.
So rather than attempt to exhaustively test software for vulnerability, security test-
ing takes the form of measuring the attacker’s work factor; how much effort must
the attacker apply to break the system. Red team experimentation (also known as
ethical hackers) is where a would-be defender deliberately hires an adversarial team
to attempt to break into the defender’s systems. Red teams and actual attackers use
largely the same methods of attack. The critical differences are:
• That red teams can be given boundaries. Defenders can ask red teams to at-
tack only selected subsets of the actual system (e.g., don’t attack the production
payroll system on payday) and expect these boundaries to be respected.
• That red teams will explain what they have done. Actual attackers may leave
an amusing calling card (web site defacement) but they often do not explain the
exact details of how they succeeded in compromising security, making it rather
expensive to first discover and then repair the holes. Professional red teams, in
contrast, will report in detail what they did, allowing defenders to learn from
the process.
DARPA funded red team experimentation on the effectiveness of the DARPA-
funded survivability technologies. For example, Levin [56] describes several red
team experiments testing the validity of a scientific hypothesis. Ideally the experi-
ment should be repeatable, but that is problematic: Levin found that it is important
to set the goals of the experiment and the rules of engagement clearly, because un-
documented claims and attacks cannot be validated.
The results are very sensitive to what the red team knows about the defender’s
system at the time of the experiment: if the defender’s system is obscured, then the
red team will spend most of their time discovering what the system is doing rather
than actually attacking it. While this is arguably similar to the situation faced by
actual attackers, it is an expensive use of the red team’s time, because it can be fairly
safely assumed that actual attackers will be able to learn whatever they want about
the defender’s system. The experimental results also depend heavily on the skills of
the red team, which are hard to reproduce exactly: a different team will assuredly
have a different set of skills, producing different rates of success against various
defenses.
An alternate approach to red team experimentation is symmetric hacker gaming, in
which the individual designated attacker and defender teams of a classical red team
engagement are replaced by a handful of teams that are competitively set to attack
each other while simultaneously defending themselves [20]. Each team is required to
138 C. COWAN
maintain an operational set of network services and applications, and a central score-
keeping server records each team’s success at keeping services functional, as well as
which team actually “owns” a given server or service. This symmetric threat model
tends to reduce disputes over the rules of engagement, because all teams are equally
subject to those rules. This results is a nearly no-holds-barred test of survivability.
We entered an Immunix server in the Defcon “Root Fu” (nee “Capture the Flag”)
games in 2002 and 2003, with mixed results. In both games, we placed 2nd of
8 teams. In the 2002 game, the Immunix machine was never compromised, but it
did take most of the first day to configure the Immunix secure OS such that it could
earn points, because the required functionality was obscured, being specified only as
a reference server image that did provide the required services, but was also highly
vulnerable. However, the Immunix server was also DoS’d to the point where it no
longer scored points; in retrospect not very surprising, as Immunix was designed to
prevent intrusion, not DoS. The 2003 game explicitly prohibited DoS attacks, but
teams deployed DoS attacks anyway. So even in symmetric red teaming, rules of
engagement matter, if the rules are not uniformly enforced.
Another form of empirical testing is to measure the precision of intrusion detec-
tion using a mix of test data known to be either “good” or “bad.” DARPA sponsored
such a study in the 1998 MIT/Lincoln Labs intrusion detection test [57,61]. The
goal of DARPA’s intrusion detection research was to be able to detect 90% of at-
tacks (including especially novel attacks) while reducing false positive reports by
an order of magnitude over present intrusion detection methods. Intrusion detection
systems deployed by the US Air Force in the mid-1990s were expensive to oper-
ate because network analysts had to spend many hours investigating false positive
“intrusion events” that were actually benign.
The results were mixed. False positive rates were significantly lowered versus pre-
vious generations of technologies, but were still high enough that intrusion detection
still requires significant human intervention. Worse, the detectors failed to detect a
significant number of novel attacks. Only a few of the tested technologies were able
to detect a few of the novel attacks.
Yet another aspect of empirical measurement is to examine the behavior of attack-
ers. This behavior matters because survivability is essentially fault tolerance against
the faults that attackers will induce, and so expectations of survivability need to be
measured against this threat. Browne, Arbaugh, McHugh and Fithen [13] present a
trend analysis of exploitation, studying the rates at which systems are compromised,
with respect to the date on which the vulnerabilities in question were made public.
This study showed that exploitation spikes not immediately following disclosure of
the vulnerability, but rather after the vulnerability is scripted (an automatic exploit is
written and released).
SURVIVABILITY: SYNERGIZING SECURITY AND RELIABILITY 139
Our own subsequent study [7] statistically examined the relative risks of patching
early (risk of self-corruption due to defective patches) versus the risk of patching
later (risk of security attack due to an unpatched vulnerability) and found that ap-
proximately 10 days after a patch is released is the optimal time to patch. However,
the gap between the time a vulnerability is disclosed and when it is scripted appears
to be closing rapidly [91] and so this number is expected to change.
5. Related Work
The field of Information Survivability dates back to the early 1990s, when DARPA
decided to take a fresh approach to the security problem. In a field so young, there
are not many survey papers. In 1997, Ellison et al. [31] surveyed this emerging disci-
pline, which they characterize as the ability of a system to carry out its mission while
connected to an unbounded network. An unbounded network is one with no cen-
tralized administration, and thus attackers are free to connect and present arbitrary
input to the system, exemplified by the Internet. They distinguish survivability from
security in that survivability entails a capacity to recover. Unfortunately, they were
anticipating the emergence of recovery capability, and that capability has yet to ef-
fectively emerge from survivability research. Self recovery capacity remains an area
of strong interest [59]. They distinguish survivability from fault tolerance in that fault
tolerant systems make failure statistically improbable in the face of random failures,
but cannot defend against coincident failures contrived by attackers, as described in
Section 2.
Stavridou et al. [77] present an architectural view of how to apply the techniques
of fault tolerance to provide intrusion tolerance. They propose that individual com-
ponents should be sufficiently simple that their security properties can be formally
assured, and that the entire system should be multilevel secure (so security faults are
isolated) as in Section 3.1.
Powell et al. [1] describe the MAFTIA (Malicious- and Accidental-Fault Toler-
ance for Internet Applications) conceptual model and architecture. This is a large
document with 16 authors, describing a long-term project. They define dependabil-
ity as the ability to deliver service that can justifiably be trusted, survivability as the
capability of a system to fulfill its mission in a timely manner, and trustworthiness
as assurance that a system will perform as expected. They conclude that all three of
these concepts are essentially equivalent. Dependability has been studied for the last
thirty years by organizations such as the IFIP working group 10.4, and from this per-
spective, survivability can be viewed as a relatively recent instance of dependability
studies.
140 C. COWAN
In 2000 we surveyed post hoc security enhancement techniques [25] which subse-
quently came to be known as intrusion prevention. This survey considered adapta-
tions (enhancements) in two dimensions:
• What is adapted:
– Interface: the enhancement changes the interface exposed to other compo-
nents.
– Implementation: the enhancement is purely internal, nominally not affecting
how the component interacts with other components.
• How the enhancement is achieved:
– Restriction: the enhancement restricts behavior, either through misuse detec-
tion or anomaly detection (see Section 3.3).
– Randomization: the enhancement uses natural or synthetic diversity to ran-
domize the system so as to make attacks non-portable with respect to the
defender’s system (see Section 3.4).
This two-dimensional space thus forms quadrants. We found that effective tech-
niques exist in all four quadrants, but that in most cases, restriction is more cost-
effective than randomization. Interestingly, we found that it is often the case that
when one goes looking for a randomization technique, one finds a restriction tech-
nique sitting conceptually beside the randomization technique that works better: if
a system attribute can be identified as something the attacker depends on, then it is
better to restrict the attacker’s access to that resource than to randomize the resource.
We also conducted a similar study with narrower focus, examining buffer overflow
attacks and defenses [28]. Responsible for over half of all security faults for the
last seven years, buffer overflows require special attention. A buffer overflow attack
must first arrange for malicious code to be present in the victim process’s address
space, and then must induce the victim program to transfer program control to the
malicious code. Our survey categorized attacks in terms of how these objectives can
be achieved, defenses in terms of how they prevent these effects, and summarized
with effective combinations of defenses to maximize coverage.
6. Conclusions
R EFERENCES
[1] Adelsbach A., Cachin C., Creese S., Deswarte Y., Kursawe K., Laprie J.-C., Powell D.,
Randell B., Riodan J., Ryan P., Simmionds W., Stroud R.J., Verssimo P., Waidner M.,
Wespi A., Conceptual Model and Architecture of MAFTIA, LAAS-CNRS, Toulouse, and
University of Newcastle upon Tyne, January 31, 2003. Report MAFTIA deliverable D21,
http://www.laas.research.ec.org/maftia/deliverables/D21.pdf.
[2] Arce I., “Woah, please back up for one second”, http://online.securityfocus.com/archive/
98/142495, October 31, 2000. Definition of security and reliability.
[3] Badger L., Sterne D.F., et al., “Practical domain and type enforcement for UNIX”, in:
Proceedings of the IEEE Symposium on Security and Privacy, Oakland, CA, May 1995.
[4] Badger L., Sterne D.F., Sherman D.L., Walker K.M., Haghighat S.A., “A domain and
type enforcement UNIX prototype”, in: Proceedings of the USENIX Security Confer-
ence, 1995.
[5] Balzer R., “Assuring the safety of opening email attachments”, in: DARPA Information
Survivability Conference and Expo (DISCEX II), Anaheim, CA, June 12–14, 2001.
[6] Baratloo A., Singh N., Tsai T., “Transparent run-time defense against stack smashing
attacks”, in: 2000 USENIX Annual Technical Conference, San Diego, CA, June 18–23,
2000.
[7] Beattie S.M., Cowan C., Arnold S., Wagle P., Wright C., Shostack A., “Timing the appli-
cation of security patches for optimal uptime”, in: USENIX 16th Systems Administration
Conference (LISA), Philadelphia, PA, November 2002.
[8] Bellovin S.M., “Distributed firewalls”, ;login: 24 (November 1999).
[9] Bester J., Walther A., Erlinger M., Buchheim T., Feinstein B., Mathews G., Pollock R.,
Levitt K., “GlobalGuard: Creating the IETF-IDWG Intrusion Alert Protocol (IAP)”, in:
DARPA Information Survivability Conference Expo (DISCEX II), Anaheim, CA, June
12–14, 2001.
[10] Bhatkar S., DuVarney D.C., Sekar R., “Address obfuscation: an approach to combat
buffer overflows, format-string attacks, and more”, in: 12th USENIX Security Sympo-
sium, Washington, DC, August 2003.
[11] Bobert W.E., Kain R.Y., “A practical alternative to hierarchical integrity policies”, in:
Proceedings of the 8th National Computer Security Conference, Gaithersburg, MD,
1985.
[12] Bray B., Report, How Visual C++ .Net Can Prevent Buffer Overruns, Microsoft, 2001.
[13] Browne H.K., Arbaugh W.A., McHugh J., Fithen W.L., “A trend analysis of exploita-
tions”, in: Proceedings of the 2001 IEEE Security and Privacy Conference, Oakland,
CA, May 2001, pp. 214–229, http://www.cs.umd.edu/~waa/pubs/CS-TR-4200.pdf.
142 C. COWAN
[14] CERT Coordination Center, “CERT Advisory CA-1999-04 Melissa Macro Virus”, http://
www.cert.org/advisories/CA-1999-04.html, March 27, 1999.
[15] CERT Coordination Center, “CERT Advisory CA-2000-04 Love Letter Worm”, http://
www.cert.org/advisories/CA-2000-04.html, May 4, 2000.
[16] CERT Coordination Center, “CERT Advisory CA-2002-07 Double Free Bug in
zlib Compression Library”, http://www.cert.org/advisories/CA-2002-07.html, March 12,
2002.
[17] CERT Coordination Center, “CERT Incident Note IN-2003-03”, http://www.cert.org/
incident_notes/IN-2003-03.html, August 22, 2003.
[18] Chen H., Wagner D., “MOPS: an infrastructure for examining security properties of
software”, in: Proceedings of the ACM Conference on Computer and Communications
Security, Washington, DC, November 2002.
[19] Cheswick W.R., Bellovin S.M., Firewalls and Internet Security: Repelling the Wily
Hacker, Addison-Wesley, 1994.
[20] Cowan C., Arnold S., Beattie S.M., Wright C., “Defcon capture the flag: Defending
vulnerable code from intense attack”, in: DARPA Information Survivability Conference
and Expo (DISCEX III), Washington, DC, April 22–24, 2003.
[21] Cowan C., Barringer M., Beattie S., Kroah-Hartman G., Frantzen M., Lokier J., “For-
matGuard: automatic protection from printf format string vulnerabilities”, in: USENIX
Security Symposium, Washington, DC, August 2001.
[22] Cowan C., Beattie S., Johansen J., Wagle P., “PointGuard: protecting pointers from
buffer overflow vulnerabilities”, in: USENIX Security Symposium, Washington, DC, Au-
gust 2003.
[23] Cowan C., Beattie S., Pu C., Wagle P., Gligor V., “SubDomain: parsimonious server
security”, in: USENIX 14th Systems Administration Conference (LISA), New Orleans,
LA, December 2000.
[24] Cowan C., Beattie S., Wright C., Kroah-Hartman G., “RaceGuard: kernel protection
from temporary file race vulnerabilities”, in: USENIX Security Symposium, Washington,
DC, August 2001.
[25] Cowan C., Hinton H., Pu C., Walpole J., “The cracker patch choice: an analysis of post
hoc security techniques”, in: Proceedings of the 19th National Information Systems Se-
curity Conference (NISSC 2000), Baltimore, MD, October 2000.
[26] Cowan C., Pu C., Hinton H., “Death, taxes, and imperfect software: surviving the in-
evitable”, in: Proceedings of the New Security Paradigms Workshop, Charlottesville,
VA, September 1998.
[27] Cowan C., Pu C., Maier D., Hinton H., Bakke P., Beattie S., Grier A., Wagle P., Zhang Q.,
“StackGuard: automatic adaptive detection and prevention of buffer-overflow attacks”,
in: 7th USENIX Security Conference, San Antonio, TX, January 1998, pp. 63–77.
[28] Cowan C., Wagle P., Pu C., Beattie S., Walpole J., “Buffer overflows: attacks and de-
fenses for the vulnerability of the decade”, in: DARPA Information Survivability Con-
ference and Expo (DISCEX), January 2000. Also presented as an invited talk at SANS
2000, March 23–26, 2000, Orlando, FL, http://schafercorp-ballston.com/discex.
[29] “ “Solar Designer”, Non-Executable User Stack”, http://www.openwall.com/linux/.
SURVIVABILITY: SYNERGIZING SECURITY AND RELIABILITY 143
[30] Dietrich S., Ryan P.Y.A., “The survivability of survivability”, in: Proceedings of the
Information Survivability Workshop (ISW 2002), Vancouver, BC, March 2002.
[31] Ellison R.J., Fisher D.A., Linger R.C., Lipson H.F., Longstaff T., Mead N.R., Surviv-
able Network Systems: An Emerging Discipline, Report CMU/SEI-97-TR-013, Software
Engineering Institute, November 1997, http://www.cert.org/research/tr13/97tr013title.
html.
[32] Eskin E., Lee W., Stolfo S.J., “Modeling system calls for intrusion detecting with dy-
namic window sizes”, in: DARPA Information Survivability Conference and Expo (DIS-
CEX II), Anaheim, CA, June 12–14, 2001.
[33] Etoh H., “GCC extension for protecting applications from stack-smashing attacks”,
http://www.trl.ibm.com/projects/security/ssp/, November 21, 2000.
[34] Feinstein L., Schnackenberg D., Balupari R., Kindred D., “Statistical approaches to
DDoS attack detection and response”, in: DARPA Information Survivability Conference
and Expo (DISCEX III), Washington, DC, April 22–24, 2003.
[35] Forrest S., Hofmeyr S.A., Somayaji A., Longstaff T.A., “A sense of self for UNIX
processes”, in: Proceedings of the IEEE Symposium on Security Privacy, Oakland, CA,
1996.
[36] Forrest S., Somayaji A., Ackley D.H., “Building diverse computer systems”, in: HotOS-
VI, May 1997.
[37] Fraiser T., Loscocco P., Smalley S., et al., “Security enhanced Linux”, http://www.nsa.
gov/selinux/, January 2, 2001.
[38] Fraser T., Badger L., Feldman M., “Hardening COTS software with generic software
wrappers”, in: Proceedings of the IEEE Symposium on Security and Privacy, Oakland,
CA, May 1999.
[39] Gao Z., Hui Ong C., Kiong Tan W., “Survivability assessment: modelling dependen-
cies in information systems”, in: Proceedings of the Information Survivability Workshop
(ISW 2002), Vancouver, BC, March 2002.
[40] Ghosh A.K., Schwatzbard A., Shatz M., “Learning program behavior profiles for intru-
sion detection”, in: Proceedings of the First USENIX Workshop on Intrusion Detection
and Network Monitoring, Santa Clara, CA, April 1999.
[41] Gosling J., McGilton H., “The Java language environment: A White Paper”, http://www.
javasoft.com/docs/white/langenv/, May 1996.
[42] Hamonds K.H., “The strategy of the Fighter Pilot”, Fast Company 59 (June 2002).
[43] Hollebeek T., Berrier D., “Interception, wrapping and analysis framework for Win32
Scripts”, in: DARPA Information Survivability Conference and Expo (DISCEX II), Ana-
heim, CA, June 12–14, 2001.
[44] Hurley E., “Keeping up with patch work near impossible”, SearchSecurity, http://
searchsecurity.techtarget.com/originalContent/0,289142,sid14_gci796744,00.html, Jan-
uary 17, 2002.
[45] “Tripwire Security Incorporated. Tripwire.org: Tripwire for Linux”, http://www.tripwire.
org/.
[46] Jim T., Morrisett G., Grossman D., Hicks M., Cheney J., Wang Y., “Cyclone: A safe
dialect of C”, in: Proceedings of USENIX Annual Technical Conference, Monteray, CA,
June 2002.
144 C. COWAN
[47] Just J.E., Reynolds J.C., “HACQIT: Hierarchical adaptive control of QoS for intrusion
tolerance”, in: Annual Computer Security Applications Conference (ACSAC), New Or-
leans, LA, December 10–14, 2001.
[48] Kc G.S., Keromytis A.D., Prevelakis V., “Countering CodeInjection attacks with Instruc-
tionSet randomization”, in: Proceedings of the 10th ACM Conference on Computer and
Communications Security (CCS 2003), Washington, DC, October 2003.
[49] Kernighan B.W., Ritchie D.M., The C Programming Language, second ed., Prentice-
Hall, Englewood Cliffs, NJ, 1988.
[50] Kewley D., Fink R., Lowry J., Dean M., “Dynamic approaches to thwart adversary in-
telligence gathering”, in: DARPA Information Survivability Conference and Expo (DIS-
CEX II), Anaheim, CA, June 12–14, 2001.
[51] Kim G.H., Spafford E.H., “Writing, supporting, and evaluating Tripwire: A publicly
available security tool”, in: Proceedings of the USENIX UNIX Applications Development
Symposium, Toronto, Canada, 1994, pp. 88–107.
[52] Knight J.C., Leveson N.G., “An experimental evaluation of the assumptions of indepen-
dence in multiversion programming”, IEEE Transactions on Software Engineering 12 (1)
(1986) 96–109.
[53] Knight J.C., Strunk E.A., Sullivan K.J., “Towards a rigorous definition of information
system survivability”, in: DARPA Information Survivability Conference and Expo (DIS-
CEX III), Washington, DC, April 22–24, 2003.
[54] Lampson B.W., “Protection”, in: Proceedings of the 5th Princeton Conference on Infor-
mation Sciences and Systems, Princeton, NJ, 1971. Reprinted in ACM Operating Sys-
tems Review 8 (1) (January 1974) 18–24.
[55] Lee W., Stoflo S.J., Chan P.K., Eskin E., Fan W., Miller M., Hershkop S., Zhang J.,
“Real time data mining-based intrusion detection”, in: DARPA Information Survivability
Conference and Expo (DISCEX II), Anaheim, CA, June 12–14, 2001.
[56] Levin D., “Lessons learned in using live red teams in IA experiments”, in: DARPA Infor-
mation Survivability Conference and Expo (DISCEX III), Washington, DC, April 22–24,
2003.
[57] Lippmann R., Haines J.W., Fried D.J., Korba J., Das K., “The 1999 DARPA off-line
intrusion detection evaluation”, in: Recent Advances in Intrusion Detection (RAID),
Toulouse, France, October 2–4, 2000.
[58] Liu P., “Engineering a distributed intrusion tolerant database system using COTS compo-
nents”, in: DARPA Information Survivability Conference and Expo (DISCEX III), Wash-
ington, DC, April 22–24, 2003.
[59] Liu P., Pal P., Workshop on Survivable and Self-Regenerative Systems, October 31, 2003.
In conjunction with the ACM International Conference on Computer and Communica-
tions Security (CCS-10).
[60] Loscocco P., Smalley S., “Integrating flexible support for security policies into the Linux
operating system”, in: Proceedings of the FREENIX Track: 2001 USENIX Annual Tech-
nical Conference (FREENIX ’01), June 2001.
[61] McHugh J., “The 1998 Lincoln Lab IDS evaluation—a critique”, in: Recent Advances in
Intrusion Detection (RAID), Toulouse, France, October 2–4, 2000.
SURVIVABILITY: SYNERGIZING SECURITY AND RELIABILITY 145
[62] Michael C.C., “Finding the vocabulary of program behavior data for anomaly detection”,
in: DARPA Information Survivability Conference and Expo (DISCEX III), Washington,
DC, April 22–24, 2003.
[63] Milner R., Tofte M., Harper R., The Definition of Standard ML, The MIT Press, 1990.
[64] Necula G.C., McPeak S., Weimer W., “CCured: type-safe retrofitting of legacy code”,
in: Proceedings of the 29th ACM Symposium on Principles of Programming Languages
(POPL02), London, UK, January 2002. Also available at http://raw.cs.berkeley.edu/
Papers/ccured_popl02.pdf.
[65] O’Brien D., “Intrusion tolerance via network layer controls”, in: DARPA Information
Survivability Conference and Expo (DISCEX III), Washington, DC, April 22–24, 2003.
[66] Papadopoulos C., Lindell R., Mehringer J., Hussain A., Govindan R., “COSSACK: Co-
ordinated Suppression of Simultaneous Attacks”, in: DARPA Information Survivability
Conference and Expo (DISCEX III), Washington, DC, April 22–24, 2003.
[67] Payne C., Markham T., “Architecture and applications for a distributed embedded fire-
wall”, in: Annual Computer Security Applications Conference (ACSAC), New Orleans,
LA, December 10–14, 2001.
[68] Porras P., Neumann P., “EMERALD: Event Monitoring Enabling Responses to Anom-
alous Live Disturbances”, in: Proceedings of the 20th National Information Systems Se-
curity Conference (NISSC 1997), Baltimore, MD, October 1997.
[69] Ptacek T.H., Newsham T.N., Insertion, Evation, and Denial of Service: Eluding Net-
work Intrusion Detection, Report, Network Associates Inc., January 1998, http://www.
nai.com/products/security/advisory/papers/ids-html/doc001.asp.
[70] Rushby J., “Critical system properties: Survey and taxonomy”, Reliability Engineering
and System Safety 43 (2) (1994) 189–219.
[71] Saltzer J.H., Schroeder M.D., “The protection of information in computer systems”, Pro-
ceedings of the IEEE 63 (9) (November 1975).
[72] Savage S., Wetherall D., Karlin A., Anderson T., “Network support for IP traceback”,
IEEE/ACM Transactions on Networking 9 (3) (June 2001) 226–237.
[73] Schmid M., Hill F., Ghosh A.K., Bloch J.T., “Preventing the execution of unauthorized
Win32 applications”, in: DARPA Information Survivability Conference and Expo (DIS-
CEX II), Anaheim, CA, June 12–14, 2001.
[74] Schnackenberg D., Djahandari K., Sterne D., “Infrastructure for intrusion detection and
response”, in: DARPA Information Survivability Conference and Expo (DISCEX), Janu-
ary 2000.
[75] Secure Software, “RATS: Rough Auditing Tool for Security”, http://www.
securesoftware.com/download_rats.htm, July 2002.
[76] Song D., “Fragroute”, http://monkey.org/~dugsong/fragroute/, May 27, 2002.
[77] Stavridou V., Dutertre B., Riemenschneider R.A., Saldi H., “Intrusion tolerant software
architectures”, in: DARPA Information Survivability Conference and Expo (DISCEX II),
Anaheim, CA, June 12–14, 2001.
[78] Strom R.E., Bacon D.F., Goldberg A., Lowry A., Yellin D., Yemini S.A., Hermes: A Lan-
guage for Distributed Computing, Prentice-Hall, 1991.
146 C. COWAN
[79] Strom R.E., Yemini S.A., “Typestate: A programming language concept for enhancing
software reliability”, IEEE Transactions on Software Engineering 12 (1) (January 1986)
157–171.
[80] Stroustrup B., The C++ Programming Language, Addison-Wesley, Reading, MA, 1987.
[81] Tan K.M.C., Maxion R.A., “ “Why 6?” Defining the operational limits of stide, an
anomaly-based intrusion detector”, in: Proceedings of the IEEE Symposium on Security
and Privacy, Oakland, CA, May 2002.
[82] “ ‘The PaX Team’. PaX”, http://pageexec.virtualave.net/, May 2003.
[83] “ “tf8”, Wu-Ftpd remote format string stack overwrite vulnerability”, http://www.
securityfocus.com/bid/1387, June 22, 2000.
[84] Thomas R., Mark B., Johnson T., Croall J., “NetBouncer: client-legitimacy-based high-
performance DDoS filtering”, in: DARPA Information Survivability Conference and
Expo (DISCEX III), Washington, DC, April 22–24, 2003.
[85] Turing A., “On computable numbers with an application to the Entscheidungsproblem”,
Proc. London Math. Society 42 (2) (1937) 230–265.
[86] Valdes A., “Detecting novel scans through pattern anomaly detection”, in: DARPA Infor-
mation Survivability Conference and Expo (DISCEX III), Washington, DC, April 22–24,
2003.
[87] Viega J., Bloch J.T., Kohno T., McGraw G., “ITS4: A static vulneability scanner for C
and C++ code”, in: Annual Computer Security Applications Conference (ACSAC), New
Orleans, LA, December 2000, http://www.cigital.com/its4/.
[88] Vigna G., Eckmann S.T., Kemmerer R.A., “The STAT tool suite”, in: DARPA Informa-
tion Survivability Conference and Expo (DISCEX), January 2000.
[89] Wagner D., Foster J.S., Brewer E.A., Aiken A., “A first step towards automated detection
of buffer overrun vulnerabilities”, in: NDSS (Network and Distributed System Security),
San Diego, CA, February 2000.
[90] Wagner D., Soto P., “Mimicry attacks on HostBased intrusion detection systems”, in:
Proceedings of the 9th ACM Conference on Computer and Communications Security
(CCS 2002), Washington, DC, October 2002.
[91] Walsh L.M., “Window of opportunity closing for patching”, Security Wire Digest 5 (66)
(September 4, 2003), http://infosecuritymag.techtarget.com/ss/0,295812,sid6_iss82,00.
html#news2.
[92] Wheeler D., “Flawfinder”, http://www.dwheeler.com/flawfinder/, July 2, 2002.
[93] Wright C., Cowan C., Smalley S., Morris J., Kroah-Hartman G., “Linux security module
framework”, in: Ottawa Linux Symposium, Ottawa, Canada, June 2002.
[94] Wright C., Cowan C., Smalley S., Morris J., Kroah-Hartman G., “Linux security mod-
ules: general security support for the Linux kernel”, in: USENIX Security Symposium,
San Francisco, CA, August 2002, http://lsm.immunix.org.
[95] Xu J., Kalbarczyk Z., Iyer R.K., “Transparent runtime randomization for security”, in:
Proceedings of the 22nd Symposium on Reliable Distributed Systems (SRDS’2003), Flo-
rence, Italy, October 2003.
[96] Zhang Y., Dao S.K., Vin H., Alvisi L., Wenke Lee L.A., “Heterogeneous networking: a
new survivability paradigm”, in: Proceedings of the New Security Paradigms Workshop,
Cloudcroft, NM, September 2001.
Smart Cards
KATHERINE M. SHELFER
Drexel University, Philadelphia, PA, USA
kathy.shelfer@xis.drexel.edu
CHRIS CORUM
Avisian, Corp., Tallahassee, FL, USA
chris@avisian.com
J. DREW PROCACCINO
Rider University, Lawrenceville, NJ, USA
jdproc@aol.com
JOSEPH DIDIER
Infinacard, Inc., St. Petersburg, FL, USA
jdider@infinacard.com
Abstract
This paper presents an overview of the history, commercialization, technology,
standards, and current and future applications of smart cards. Section 1 is an
overview of smart cards, including their current global use in identification, ver-
ification and authorization applications through their ability to support transac-
tion processing, information management and multiple applications on a single
card. This section also includes a summary of the invention and early develop-
ment and application of smart cards. The second section describes a typical smart
card-based transaction, tracing it from the initial contact between a card and the
card reader through the transaction to termination of the transaction. The third
section describes the physical characteristics of the smart card, and its associated
contact and contactless interfaces, integrated circuit (IC) chip and processor ca-
pacity. Section 4 summarizes the international standards associated with smart
cards, including those related to interoperability among contact and contactless
cards, and their respective reading devices. In Section 5, the focus is a high-level
discussion of associated access technologies, including a more detailed look at
magnetic stripe and barcode technologies and standards. This section includes a
very brief mention of the impact of RISC-based technologies and Sun’s Java™
Virtual Machine® . Section 6 discusses smart card security relating to the card’s
ability to authorize and facilitate electronic, logical and physical access to con-
trolled applications and physical locations. Also discussed is physical security,
which relates to cardholders, environment and cards tampering, and data secu-
rity, which is related to smart cards ability to support cryptography and cross
validation of data stored on the cards across multiple databases for purposes
of identification verification. Section 7 concludes this paper with a look at the
future of smart card-related developments, including those related to both tech-
nology and applications. Technology-related developments include the support
of more than a single operating system on the processor chip and peripheral card
technologies. Application-related developments include those related to identifi-
cation, information storage and transaction processing.
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
1.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
1.2. The Invention of Smart Cards . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
2. A Typical Smart Card Transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3. Smart Card Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.1. Technology Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.2. Physical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
3.3. Contact and Contactless Smart Cards . . . . . . . . . . . . . . . . . . . . . . . 160
3.4. Physical Characteristics of the Integrated Circuit (IC) Chip . . . . . . . . . . . 160
3.5. Processor Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
3.6. Current Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4. Smart Card Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.1. Smart Card Standards Organizations . . . . . . . . . . . . . . . . . . . . . . . . 164
4.2. Early Smart Card Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
4.3. Contact Smart Card Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
4.4. Contactless Smart Cards Standards . . . . . . . . . . . . . . . . . . . . . . . . 167
4.5. Nonstandard Contactless Technologies . . . . . . . . . . . . . . . . . . . . . . 170
4.6. Comparison of ISO/IEC 14443 and ISO/IEC 15693 . . . . . . . . . . . . . . . 170
4.7. The Role of Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5. Associated Access Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.1. Electro-Technology Access: The Magnetic Stripe . . . . . . . . . . . . . . . . 174
5.2. RISC-Based Smart Cards and The Java Virtual Machine . . . . . . . . . . . . . 178
5.3. Multiple Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6. Smart Card Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.1. Physical Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.2. Data Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
SMART CARDS 149
1. Introduction
1.1 Overview
At this time, smart card applications are used to (1) encourage and protect lawful
economic activity; (2) ensure the survival of critical infrastructures and (3) protect
individuals and societies from those who would deliberately do them harm. There
are many contributing factors that determine the development and deployment of
smart card systems. Among these are the way in which smart cards were invented
and commercialized; current directions in applied research and development; the de-
velopment and support of international standards; and the impact of human concerns
about data security and data privacy. Any item with an embedded microprocessor
chip can be considered “smart,” from keychain fobs to cooking utensils [9]. The most
familiar form, however, is a thin plastic card (refer to Section 3, Smart card technol-
ogy) that contains one or more integrated circuit (IC) chips. The microprocessor, or
“smart” chip, interacts either physically with (contact) or in proximity to (contact-
less) a smart card reader, or both. Information on the card interacts with information
in the reader to authorize the requested session (transaction), for which the chip has
been programmed [21,74].
The smart card is designed to store and serve a range of personal data related to the
cardholder that is used to authorize specific transactions. The value of the smart card
is that it can be used to secure digital transactions that rely on personal identification,
identity verification, and transaction authorization [40,82].
• Identification. Smart cards store information about the cardholder’s identity in
digital form—bank account numbers, organizational affiliations, personal bio-
metrics, etc.—that is used to secure digital transactions, i.e., ATM transactions
and eCommerce [57,58].
• Verification. The smart card stores a range of personal identity data, i.e., biomet-
rics, that provide means of comparing digital identities with physical identities
in settings where neither form of identification would be sufficient.
• Authorization. Smart cards are used to electronically authorize the cardholder’s
right to initiate and engage in specific transactions that involve logical, physical
and electronic access controls.
150 K.M. SHELFER ET AL.
Smart cards allow 2-party digital transactions (cardholder and card issuer) to be doc-
umented by a 3rd party—in this case, the smart card system. Smart cards support one
or more of the following applications:
(1) Credit. Smart cards secure merchant transactions and reduce fraud. This al-
lows cardholders to have financial flexibility in the form of pre-approved trans-
border cash advances and loans, initiated at the point-of-sale that are tied di-
rectly to the purchase of specific goods/services. The credit function rarely
requires a personal identification number.
(2) Debit. Smart cards are used to provide electronic direct debit at ATM ma-
chines and at specific points-of-sale. The debit function improves the card-
holder’s access to specific goods and services in locations where the card-
holder’s identify and credit history are unknown, and protects the cardholder’s
banking relationship by allowing a 3rd-party to directly debit the cardholder’s
financial institution for the purchase of specified goods/services at the point
of sale. It should be noted that such transactions in theory—but not always
in practice—require a personal identification number (PIN) as an additional
layer of security. This situation is currently being discussed in contract nego-
tiations and in the courts.
(3) Stored-value. A fixed value is initially encoded and subsequent purchases are
deducted (nominally, but not always, to a “zero” balance). This transaction
does not require access to the cardholder’s personal identity, so no PIN is
required. Two examples of single-function stored-value cards are telephone
cards and retail merchant gift cards. On magnetic stripe cards, this stored
value is placed on a “junk” stripe. On smart cards, this stored value is placed
in one or more electronic “purses.” Stored-value may be disposable (debit
“read-only”) or re-loadable (debit/credit “read-write”). While ownership of
the abandoned value on the card, or escheat, is disputable, it has been treated
as a source of substantial additional revenue in some settings, including the
Olympic games held in Atlanta [30,32,63,72].
(4) Information management. Card issuers provide these cards to individuals in
order to facilitate the portable storage and use of the cardholder’s personal
information, i.e., bank credit/debit accounts, insurance information, medical
history, emergency contact information and travel documents. http://www.
iso.org/iso/en/commcentre/isobulletin/articles/2003/pdf/medicalcard03-06.pdf.
This information is used to verify the cardholder’s identity and authorize spec-
ified digital transactions [8,60,66,69,77–79,91].
(5) Loyalty/affinity. A variety of vendor incentives, recorded at the point of sale,
are tied to the purchase of goods/services. Incentives may include points,
credits, discounts and/or direct delivery of products/services. Since the card-
holder’s identity is known, purchasing patterns can be tracked and loyal cus-
SMART CARDS 151
two small trials were conducted in the US: (1) the First Bank Systems of Minneapo-
lis trial, involving ten (10) North Dakota farmers, and (2) a Department of Defense
(DOD) trial with a few hundred soldiers in Fort Lee, New Jersey. In 1983, Innovatron
awarded development rights for the USA, the UK and Japan to Smart Card Interna-
tional, a USA company. In 1986, MasterCard tested the technology in Columbia,
Maryland and Palm Beach, Florida.
In addition to smart card innovations themselves, universities have also played a
role as early adopters of smart card systems. In the USA, there were a number of
college and university campuses that upgraded from magnetic identification cards
to smart identification cards. Bill Norwood and eight colleagues left The Florida
State University (FSU) to found Cybermark, LLC, a joint venture of Sallie Mae,
The Batelle Memorial Institute, the Huntingdon Bank and the Florida State Univer-
sity. As a result of first-hand knowledge of the needs of the target customer group,
Cybermark had a first-mover advantage in the design and deployment of campus
smart card installations. In May 1999, for example, Cybermark added a credit card
option to the card (http://www.autm.net/pubs/survey/1999/cybermark.html). At one
point, some sixty people (45 in Tallahassee, Florida) processed transactions for over
700,000 issued cards. The number of cards doubled when SchlumbergerSema asked
Cybermark to service its existing campus card installations [17,20,23]. Table I pro-
vides a summary timeline of the early history of smart cards.
By 2002, there were still fewer than 35 (mainly European) companies in control
of 95% of the smart card market. At one point, three of Europe’s top ten high tech-
nology companies had smart card product lines, although most of these products
TABLE I
S MART C ARD T IMELINE
Year Event
1968 2 German inventors, Jorgen Dethloff and Helmut Grotrupp, patent the idea of combining micro
chips with plastic cardsa
1970 A Japanese inventor, Kunitaka Arimura receives Japanese (only) patent protectionb
1974 French journalist, Roland Moreno, receives patent in Franceb
1976 French Licenses awarded to Bull (France) as result of DGT initiativeb
1979 Schlumberger (France) and Philips (Netherlands) receive first Innovatron licensesc
1980 GIE sponsors first smart card trials in 3 French citiesb
1982 First USA trials held in North Dakota and New Jerseyb
1996 First USA university campus deploys smart cardsb
1998 Gemplus introduces contactless IC
a [42];
b [81];
c [3].
154 K.M. SHELFER ET AL.
TABLE II
S MART C ARD S HIPMENTS a BY R EGION
Region 1999 2000 2001 2002 2003 2004 Totals Region (%)
still had limited functionality. The smart identification card market itself, however,
was not living up to its early promise. This changed as a result of the bombing of the
World Trade Center on September 11, 2001. Identity became inextricably linked with
protection of national economies, critical infrastructures, and citizens. At the same
time, quality had improved, costs of smart card systems had plummeted, and there
were many new applications that could take advantage of the chip—most notably,
biometrics. This situation resulted in a race to deploy smart card identification sys-
tems, especially for government, military and law enforcement personnel and transit
authorities [61,64,88]. In addition, there continues to be a growing awareness of tan-
gible and intangible benefits associated with the use of intelligent identification cards
[81,55]. For example, a simple Google® search can return thousands of hits on the
topic.
Today, the market is truly global. Table II shows recent and predicted worldwide
growth of smart card shipments by region [67]. Sadly, like so many other entrepre-
neurial ventures, Cybermark later failed for business reasons. However, it deserves
its place in history as an innovator in multi-application smart card systems. Today,
many companies have entered this market, and there are hundreds of companies, sev-
eral organizations and specialized publications that demonstrate the increasing im-
portance of smart card systems around the world (http://www.smartcardalliance.org;
http://www.eurosmart.com).
(1) Contact. Contact may be direct, proximate to, or in the vicinity of the reader.
This depends on the type of card and the capability of the reader/terminal the
cardholder wishes to use. In the case of a contact smart card, the card must be
physically inserted into a reader/terminal and remain in contact with it during
the session. In the case of contact smart cards, the reader/terminal verifies
that the card is properly situated and that the power supplies of both card and
156 K.M. SHELFER ET AL.
reader are compatible. It then supplies the card with the power necessary to
complete the transaction through the chip’s designated contact point (Cn). In
the case of a contactless smart card, this requirement is eliminated.
(2) Card validation. The card must be validated in order to establish a session.
Some reader/terminals have the ability to retain those cards found on a “hot
list” of unauthorized cards and send a notice of attempted use to the card
issuer. They may also be able to invalidate future use of a card. If the card is
valid, however, the system captures an identification number to establish an
audit trail and process the transaction using the authorized “value” for that
transaction (for example, removing cash from an electronic “purse”) that is
stored on the chip. This activity is handled by a Secure Access Module (SAM)
on a chip housed inside the reader/terminal. The SAM has the electronic keys
necessary to establish communication with the defined operating system (OS)
for the associated reader/terminal. Each OS is associated with a specified card
platform, and the card must be compatible with that OS if it is to communicate
with that particular reader/terminal and engage in a transaction.
(3) Establishing communication. A Reset command is then issued to establish
communication between the card and the reader/terminal. Clock speed is es-
tablished to control the session. In the case of both a contact and a contactless
smart card, the reader/terminal obtains the required data from the card and
initiates the requested transaction.
(4) Transaction. Data required for the session is exchanged between the card
and the reader/terminal through the defined Input/Output contact point(s).
A record of the transaction is stored on both card and reader.
(5) Session termination. For contact smart cards, the contacts are set to a stable
level, the Supply Voltage is terminated and the card is then ejected from the
terminal. For contactless cards, a termination protocol ends the session and
resets the reader.
still used in most digital financial transactions at this time. This means that logical
and physical access can be controlled from the same card (see Fig. 3).
Card transactions involve two or more parties (refer to Fig. 2 in Section 2 of this
paper); for example, cardholder + cardholder’s financial institution or [cardholder +
cardholder’s financial institution] ↔ [merchant + merchant’s financial institution]
[21]. The smart card is an improvement over simple magnetic stripe technology in
that it is able to (1) store and directly distribute containered data that can represent
more than one affiliated institution; (2) facilitate more than one type of authorized
use; (3) carry substantially larger volumes of data on-board; and (4) process transac-
tions at significantly higher rates of speed.
Today’s smart cards may be either “contact” or “contactless,” but all smart cards
have on-board processing capabilities that require an integrated circuit (IC) chip that
is physically located on the card. Applications may be split between cards and readers
or off-loaded onto the card. Applets and small databases are now being designed to
fit on smart cards.
that were developed prior to the advent of smartchip cards. Currently, international
standards call for an “ID-1” plastic card base. Basic dimensions (not to scale) of a
typical smart card are shown in Fig. 4.
The earliest smart cards were a laminated plastic sandwich. That is, two layers
of plastic were bonded together. The bottom layer was solid and the top layer had a
section cut out for the chip, which was then inserted and wired to the card. One ex-
ample, a side cut-away, is shown in Fig. 5. This technology did not provide adequate
physical protection from cardholder’s “hip pocket” use and abuse. The chip tended
to pop out and the card layers tended to peel and separate. As a result, the data on
the card could not be read.
When a card could not be read, the eCash (stored value) on the chip could not
be verified through an audit trail, so cardholders were victimized twice—first with a
faulty card, and then with loss of the cash (stored value) recorded on the chip. This
situation upset cardholders, the constituency of early adopters in the USA, as most
were students, far from home and on restricted incomes.
In addition, cards and card applications were usually provided by a complex mix
of third parties, so problems were significantly more complicated to resolve than had
been the case with a 2-party identification card. As a result, card issuers, the customer
base, were also unhappy. They had paid significantly more for this technology, yet it
SMART CARDS 159
damaged their relationship with their primary user populations. Solutions needed to
be found, so manufacturers rapidly looked for ways to create a durable plastic card
base.
One type of material, ABS plastic, or acrylonitrile-butadiene, was primarily used
in injection molding. Another plastic, PVC, or polyvinylchloride, was also tried. It
should be noted that there are several factors that impact the cost and quality of
card printers: printing speed, duplexing (the ability to print both sides of the card
in a single pass), encoding, and networking. Manufacturers also tried various print-
ing technologies. They learned that one of the better ones is Dye Diffusion Thermal
Transfer, or D2T2. In the D2T2 process, heat is used to transfer dye from a thin plas-
tic “carrier” ribbon to the surface of the card. ABS plastic was soft and it did not
work well with D2T2 printing technology. However, pure PVC was not much better.
When pure PVC was used, layers separated and peeled and dyes did not adhere to the
surface. Eventually, manufacturers found that they could increase the number of lay-
ers (to, for example, seven) and use thin layers of video grade PVC. To create video
grade PVC, certain polymers/polyesters (polyethylene terephtelates: PETG, PETP,
and PET) are added. This type of plastic works very well [22]. Manufacturers can
also choose from a half dozen processes to cement these layers together.
160 K.M. SHELFER ET AL.
Smart cards often contain a video image of the cardholder. Digital videography
has replaced the tiresome processes of taking pictures, cutting them to fit, positioning
them on a card and them laminating it all together. Images are captured using RGB
(the best and costliest), S-VHS and NTSC video signals. Images are compressed and
stored in formats such as JPEG. Image quality is determined by the size of the orig-
inal image and the compression ratio. In the past decade, card production, products
and processes have all improved. At the same time, costs have dropped dramatically.
As a result, card quality is not the problem that it once was, but customers do need
to become informed about their options. A number of publications, associations and
conferences are now available.
the ‘top’ layer, the electrical components, of an embedded smart card processor chip.
Some chips may not include all possible memory types, and additional nonvolatile
memory type NVM is not represented. Security is increased and card size is mini-
mized through the combining of all of the depicted elements into one integrated chip
[51].
In the case of contact cards, the IC chip must make physical, electrical contact
with a card reader in order to complete the specific transaction(s) for which the chip
is programmed. This is because the chip’s source of electrical power is in the reader
and not on the card. The contactless smart card need not make direct contact with the
card reader, however, because it emits low frequency radio waves that interact with
the reader at varying distances.
Recent innovations in contactless smart cards appear to target increasing the on-
board memory to absorb additional applications [7,10] and the mini databases being
designed to fit on smart cards. One recent consideration with promise is the notion
of mobile cookies [16] that allow cardholders and/or card issuers portable, personal
access histories that are independent of the computer from which the access was
requested.
TABLE IIIA
ISO/IEC S TANDARDS a (D ATE AS OF L ATEST F ULL /PARTIAL R EVISION )
a ISO 4909:1987 sets forth the standard for data content for this track. Two related standards, not discussed in this
paper, address optical memory (ISO/IEC 11693 and ISO 11694).
b Amended 2003.
TABLE IIIB
C ONTACTLESS C ARDS
• http://www.cyberd.co.uk/support/technotes/isocards.htm
• http://www.incits.org/scopes/590_1.htm
166 K.M. SHELFER ET AL.
• http://www.blackmarket-press.net/info/plastic/magstripe/Magstripe_Index.htm
• http://www.cardtest.com/specs.html
• http://www.javacard.org/others/sc_spec.htm
ISO 7810. The earliest international standard for IC cards, this standard estab-
lished the physical location of the chip on the card (see Fig. 4, above). As previously
explained, the stipulation that dictated the chip’s actual physical placement on the
plastic card base was a result of the demand by financial institutions for maximum
backward compatibility with existing magnetic stripe systems, as well as the desire
to provide maximum protection for the chip (i.e., if the card was bent, for example)
[33].
ISO 7811. This standard specifies card data formats (for example, Farrington 7B
as the specified font). Specific details in the standards address embossing, as well as
the location and data formats of the two read-only tracks (I and II) and the read-write
track (III). For more on magnetic stripes, see Section 5 of this paper.
ISO 7816. The various sections of this standard describe physical characteris-
tics of the smart card. Figure 8 is an example of the contact configuration of an IC
chip (refer to Table IV). Part 1 covers the physical characteristics of the smart card.
Part 3 specifies electrical signals. Part 4 defines, in part, the structure of stored files.
Part 5 covers High-level application communication protocols. Part 2, which covers
electrical contacts, is described below.
• Part 2. Location and size of the electronic contacts on the smart card. This
standard specifies six (6) contact points, although some chips have more. Each
contact (designated Cn below) has its own defined function:
SMART CARDS 167
F IG . 8. IC Chip contacts.
ISO/IEC 10536 Close Coupling Cards. The first standard for contact-
less cards required that cards either be inserted into a reader or placed in a very
precise location on the card reader’s surface. The limited distance and the high level
of accuracy required for a “good read” discouraged the use of contactless cards in
controlled environments. This standard has been abandoned as a result of improve-
ments in contactless card technologies.
168 K.M. SHELFER ET AL.
TABLE IV
A C OMPARISON OF PARAMETER VALUES FOR C ONTACTLESS C ARDS
Proprietary interfaces
Source: [3].
SMART CARDS 169
ISO/IEC 15693 Vicinity Cards. The vicinity card has three modes with as-
sociated ranges of operation: (1) read mode (70 cm); (2) authenticate mode (50 cm);
and (3) write mode (35 cm). There are three separate parts to this standard:
• Part 1. Physical Characteristics—IS 7:2004 establishes the physical card size
at the ID-I size (145.6 mm × 54.0 mm × .76 mm). This is the same size as a
bank credit card.
• Part 2. Air Interface and Initialization—IS 11:2001 defines frequency modula-
tion, data coding and data rate values for both reader-to-card and card-to-reader
communication.
• Part 3. Anticollision and Transmission Protocol—IS 6:2001 defines the proto-
col, command set and other parameters required to initialize communication
between a reader and a card. It also defines the anticollision parameters, which
facilitate the selection and appropriate use of a single card when multiple cards
enter the reader’s magnetic field.
(3) Sue then encrypts and sends a response to Jean, using Jean’s public key.
This also creates a “hash” or digest. Jean is able to decrypt and read the
response document from Sue. If Jean’s hash matches Sue’s hash, then
Jean can assume that Sue is really Sue (and not Sarah pretending to be
Sue).
fact that a third party can verify the transaction prevents repudiation of the transac-
tion by either of the two parties that participated.
The smart card often includes two other technologies: (1) a magnetic stripe that
facilitates backward compatibility with financial (and other) transactions; and (2) the
barcode, that facilitates contactless access for purposes such as inventory control.
These two technologies are still widely used in a range of settings, they are relatively
inexpensive and they are usually less complex to administer than smart cards. They
are often incorporated into applications on the smart card itself. For this reason, a
brief discussion of two of these associated access technologies is in order.
174 K.M. SHELFER ET AL.
• BCD Data format. 4-bits of data (5-zeros and ones) are used to create a 16-
character set (2 × 2 × 2 × 2 = 16) [12] that consists of the ten numeric digits
(0–9), 3 digits for framing and 3 digits for control. The fifth bit is treated as a
check device.
• ANSI/ISO Alpha Data Format. This 7-bit data format generates a sixty-four
character set (ten numeric characters, all 26 letters of the alphabet, 3 framing
characters and 18 control characters) using 6-bits for character generation and
1-bit as a check.
In both formats, there are at least 3 control characters. The Start Sentinel (SS) signals
the start of meaningful data. This gives the card reader a chance to synchronize and
decode the transmitted data. The End Sentinel (ES) control character is followed by
the Longitudinal Redundancy Check (LRC) character that works as an error check
for the whole line of data.
5.1.2 Barcodes
In 1948, Dr. Joseph Woodland, then a lecturer in mechanical engineering at the
Drexel Institute of Technology (now Drexel University), became interested in the
need for supermarkets to track inventory and automate the checkout process. He
found that the variance in polychromatic systems was too great, but Morse code
lacked enough elements to support the necessary level of detail. By extending those
dots and dashes to create thin and thick lines, he and Bernard Silver, developed a
system to decode the lines that called for the early equivalent of a laser light. They
received US patent 2,612,994 in 1952 for this “Classifying Apparatus and Method.”
In 1973, Woodland’s invention became the basis of the Universal Product Code, or
UPC, an example of which is shown in Fig. 12 (see also Fig. 13).
Today, barcodes are assigned to products and used to link products to inventory
and sales management systems in every sector of the economy. According to data
compiled by the Uniform Code Council, UPC codes serve over 600,000 manufac-
turing companies and are scanned 5 billion times a day, but this is less than half of
today’s bar code technology. In libraries, for example, barcodes are generated and
used as unique identifiers for both individual library patrons and individual items
such as circulating books. ISO numbers (discussed in Section 5.1) are converted into
barcodes (2 × 2 × 2 × 2 = 16). Today, mini-databases of various types are being
designed to be carried on the smart card itself. ISO numbers and UPC codes are ex-
amples of the types of data stored in these databases (http://www.uc-council.org and
http://www.drexel.edu/coe/news/pubs/coepuzzleranswerinsummer2003.html).
178 K.M. SHELFER ET AL.
development of the Java card platform, ADPU were awkward, idiosyncratic and took
a long time to write. This made smart card applications proprietary, expensive, and
very slow to develop. The stated purpose of Sun Microsystem’s Java card platform is
to create an open programming architecture so that applications can be written once
to run on all cards. This would make it possible to program a smart card application
“in a day.” To accomplish this goal, Sun posts documentation and provides training
classes that include constructing and parsing ADPU (see http://java.sun.com).
For example, the card issuer may be slow to identify the new status, slow to
key the status change into the system, or slow to communicate that change
to affiliated networks or to physically transfer the status to offline card read-
ers. If this does not happen, there is a window of opportunity for abuse of
hot-listed smart cards. In the “old days” of contact smart cards, the card had
to be inserted into a slot. Card readers could be programmed to retain the
card; however, this was an unpopular option because cardholders were vic-
timized both when they inadvertently abandoned their cards and when the
card readers “ate” damaged, authorized or re-authorized cards. Contactless
cards must be “de-listed” by readers that have specific read-write function-
alities. It should be pointed out that having a card in a cardholder’s posses-
sion leads to assumptions that it is valid. In addition, de-listing and retain-
ing a card with multiple functionalities not directly associated with the orig-
inal card issuer’s primary reason for issuing the card will need to be ad-
dressed. Today, most readers do not have the capability to retain de-authorized
cards.
• Cryptography and cryptanalysis. The strongest encryption methods available
in the past were “1-time pads” that were randomly generated ciphertexts. For
the message to be decrypted, the sender and the recipient had to have ac-
cess to the same “1 time pad.” Today’s digital equivalent is the public key.
More secure systems use both a public key and hidden (secret) key. The se-
curity problems with public key systems rest on the need to trade the key,
often over unsecured lines. For this reason, the asymmetric methods are gen-
erally used for sensitive transactions. The security problems associated with
asymmetric encryption systems are caused by conflicts between government
and law enforcement (the need to know what is happening in order to pre-
vent economic crimes such as money laundering (used to fund terrorism and
other violent crimes)) and companies (the need to protect sensitive and pro-
prietary data from global competitors who engage in economic and electronic
espionage) [29]. Humans are eventually able to break most codes designed by
humans, so we can expect that sufficient computing power will eventually be
used to “break” most technology-driven encryption schemes. It is mainly a mat-
ter of resource allocation. For this reason, the security issues with encryption are
aligned with public perception and the degree of acceptable economic/societal
risk involved.
For example, Bellcore Researchers threw the smart card market into disarray
in 1995 when they announced that they had found a (theoretical) way to vio-
late the security of smart card encryption. They claimed that criminals could
heat the smart card (for example, in a microwave oven) or radiate it, thus trick-
ing the smart card into making computational mistakes. By comparing actual
182 K.M. SHELFER ET AL.
with anticipated values, criminals could use these mistakes to identify the use-
ful patterns on the smart card that provide clues to the encryption keys and
hidden information. They called this method “Cryptanalysis in the Presence
of Hardware Faults” [6]. While subsequent work has not found a way to ac-
tualize this theoretical model, such publicity certainly should trigger industry
concerns.
• Data destruction. A number of technologies could be used to invade sys-
tems and damage, destroy or steal data. These range from relatively low-
technology applications such as visual and/or keystroke surveillance of card-
holders engaged in smart card transactions to far more sophisticated tech-
niques from Van Eck Phreaking to degaussing. Another threat deals with the
common “cookie.” Cookies were introduced by Netscape in 1994 to solve
the state retention program by introducing new headers to be carried over
HTTP [16]. Cookies are stored on the client side of memory (the user’s com-
puter), which allows a history of use to be developed and maintained. Cook-
ies have been used to cross over and acquire personal data (a form of elec-
tronic trespassing) without the cardholder’s (or card issuer’s) knowledge. Re-
search is underway to enable servers to track user’s information-seeking be-
haviors once they leave the server’s site. There are law enforcement, as well
as marketing, benefits involved in finding ways to facilitate such link and pat-
tern analysis. However, while the card belongs to the issuer, the data has been
considered the property of the cardholder and the right to construct privacy
fences around inquiries is considered a fundamental tenet of a democratic soci-
ety.
• Cross validation. Identification verification is at the heart of the smart card’s
potential worth. Credit scoring systems use multiple databases, and score for
data quality, to verify identity and data association. For this reason, smart card
systems that rely on a single ID verification method, even a biometric one, are
potentially dangerous. It is much easier to erase or change data in a single data-
base than to do it in several dozen databases, especially where ownership of
the databases involves multiple encryption schemes, various data sets and a
number of organizations with competing agendas. While it is potentially ex-
pensive to include multiple biometric data sets, it is a false economy to as-
sume that a single data set is (or will always be) sufficient. The best solution
at this time is personal, portable biometrics where the cardholder’s personal
characteristics are compared to cross-validated data sets on the card and at re-
mote sites at the point of use. This is extremely expensive and resource inten-
sive.
SMART CARDS 183
7. Future Developments
At this time, there are three primary concerns that limit widespread market accep-
tance of the sophisticated features of smart cards: (1) available applications; (2) de-
ployment costs; and (3) public concerns regarding such issues as data security (refer
to the discussion on data security, above) and personal privacy.
• Applications. Applications require processing capability and memory capac-
ity, which has been improved by the move from Assembly to C programming
languages, use of Java, market preference for global standards and open archi-
tecture, and a growing interest in developing mini-databases.
• Deployment costs. While the relatively slow migration from magnetic stripe
cards to smart cards continues to be based on financial factors (e.g., financial
institutions with large investments in magnetic card systems are slow to invest
in the new technology), the migration to more secure transaction processing
systems represents regulatory differences. In countries where online transac-
tion costs are low, online, real-time transactions are commonplace, and there is
little economic incentive to migrate to more secured transactions. In countries
where the cost of online transactions is high, off-line (batch) processing results
in higher rates of card-related economic fraud, and there is more willingness to
migrate to the more secure smart card technology.
The “true” cost of any technology, however, includes both tangible costs and
intangible benefits. Recent improvements in managerial and cost accounting
practices enable decision makers to gain better insights into the return on invest-
ment of more secure access controls. Particularly after the attacks of Septem-
ber 11, 2001, smart card identification systems were investigated and there have
been many new installations of these systems. Returns on investment do not
always appear where they are anticipated. After implementing smart card tech-
nology, universities in the US noticed sharp reductions in armed robberies and
vandalism of vending machines. Financial assistance to students was processed
more quickly and involved fewer staff. American Express issued the Blue
Card® , even though sophisticated applications were not yet in place. The ex-
pectation was that customers would “upgrade” their cards. The company found
that simply having this technology available attracted many new customers [24].
• Personal privacy. Cardholder privacy becomes an issue where new develop-
ments allow applications to share PINs and user access histories become at-
tached to individuals rather than specific terminals (mobile cookies). Economic
conditions have not recently favored such investments. The slow rate of adop-
tion is partly a result of psychological obstacles [34]. Research on National
ID card program success/failure [83], for example, found that turf battles be-
184 K.M. SHELFER ET AL.
Smart card-based identification has the potential to avoid both the fraud
associated with the issue of driving licenses today and to meet the need for
ID verification without offending an individual’s personal beliefs. For exam-
ple, verification of an individual processing a typical US driver’s license is
based on a photograph of the cardholder’s face, but some individuals keep
their faces covered in public. Therefore, photographs show only cloth head
coverings. In addition, the licenses are valid for such extended periods of time
that physical characteristics can change. High-tech, smart card-based driver’s
licenses with personal biometric data could use additional or alternate cri-
teria to verify individual identities. In addition, such licenses could include
relevant data, such as driving records or unpaid traffic fines [56,90].
– Information storage. There is a need to store ever-growing volumes of data,
particularly personal demographic information [21]. The US healthcare in-
dustry is a prime example, where the Federal government has mandated elec-
tronic social benefits transfer and HIPAA compliance. In addition, there is a
push to streamline business operations, including the automation of primar-
ily clerical functions (such as fast admission for emergency room patients
and linking unknown patients to their medical records in order to provide ap-
propriate types and levels of care) [31]. For this reason, medical and health-
care information management systems are being upgraded. At the same time,
there is a need to identify and provide services that attract new sources of
revenue that are not tied to reimbursement schedules [83]. Among the po-
tential applications of smart card technology in healthcare are: (1) automated
hospital admissions; (2) transfer of medical records, drug prescriptions and
insurance authorizations; and (3) technology information, such as individ-
ual kidney dialysis equipment settings [31]. Another new source of revenue
is vending authorization (bedside delivery of upscale meals for those on a
regular diet, for example). Again, there are significant social, political, le-
gal/regulatory and economic issues that must be considered, which are be-
yond the scope of this discussion [76,84].
– Transaction processing. Smart card readers support both traditional and
newer “in-home” ATM transactions and home shopping [80]. As we have
seen with the Euro, currencies are merely points on a scale. The magnetic
card has already proven to be a durable technology for international ATM
transactions. Smart card technology could accelerate the decline of currency
exchange and travelers’ checks because the cards can carry currency in any
foreign denomination and enable the cardholder to segregate both discre-
tionary spending and job-related travel [21,38].
The newer contactless smart cards show great promise for improving the
speed of access, e.g., transit passes and access to restricted facilities, such
186 K.M. SHELFER ET AL.
as military bases and clean rooms. The “smart” nature of the card supports
additional data capture for audit trail and security purposes. Today, smart
cards are integrated into a wide range of telecommunications technologies.
For example, GSM telephones can serve as portable ATM machines as well
as locator beacons [33]. GSM telephones today use smart SIM (Subscriber
Identity Modules) cards that can be encrypted [46,31]. There is a growing
interest in eGovernment services. One possibility is the use of smart cards
for eVoting. There is certainly a possibility that voter fraud could be reduced
by linking the voter’s biometrics with the casting of a ballot [43].
While it was once predicted that multifunction smart cards would dominate the smart
card market in the near future [80], shipment data indicates the multifunction card
has arrived. Standards appear to be relatively stable at the present time [3], but there
are additional external considerations, such as competition for application space on
the card and conflicts over transaction processing revenues. Several of the original
companies involved in smart card innovation have since left all or part of the smart
card industry. As a whole, however, the industry is healthy with many new niche
markets and competitors. Most of the companies currently engaged in the develop-
ment and sale of smart card applications are quickly developing expertise and/or the
resources required [25,27,36,37,70,71,83].
Clearly, there continue to be “software design, economics, liability and privacy
concerns, consumer acceptance and . . . other political and personal issues” [31].
There continue to be vast implications that result from the contribution of smart cards
in the “integration of commercial transactions, data warehousing and data mining”
[81]. Even so, smart card technologies are being used to improve the security of many
transactions. As a result, smart card applications play an increasingly significant role
in the nature and direction of information exchange [34,95–97].
Glossary of Terms
American National Standards Institute (ANSI) promotes national commerce
in the form of interoperability facilitated through.
EFT—Electronic Funds Transfer. And debit/credit transaction that is proc-
essed by electronic means. Through Reg E, the Federal Reserve implemented
the 1978 Electronic Funds Transfer Act of 1978.
EMV or VME Standard. Europay/MasterCard/Visa standard for contact smart
cards, beginning with placement of chip on card (upper left) and specifics of the
ABA stripe.
SMART CARDS 187
Contact. Smartchip cards that require the smartchip itself to be placed in phys-
ical contact with the reader device in a specific fashion. This contact is required
for authentication/verification/authorization process.
Farrington 7B. The type face specified in ISO standard 7811 for embossed
characters on ID cards.
Firmware. The software that is written into the read-only memory that specifies
the operation of a hardware component and is not alterable without changes to
hardware configuration.
Hot card. An issued card that is no longer considered legitimate in the system,
i.e., it has been reported lost or stolen, but has not been returned to the issuer.
RFm. An ID card in which the data from the card is transmitted to the reader via
radio frequency emissions. RF cards are a common variety of proximity cards in
that the card need not make physical contact with the reader.
Source: [13,14,85,86]
188 K.M. SHELFER ET AL.
R EFERENCES
[1] “A brief history of smart card technology: who, what, where, when and why”, Campus
ID Report 2 (1) (January 1997) 1, 3–4.
[2] “A reference guide to card-related ISO standards and committees”, Campus ID Re-
port 1 (5) (July 1996).
[3] Avisian, Inc., Contactless Smart Card Technology for Physical Access Control, 1 April
2002. An Avisian, Inc. Report, Unpublished White Paper.
[4] “Barcode basics: The ins and outs of these little black and white symbols”, Campus ID
Report 2 (3) (May 1996) 6–7.
[5] “Bar code basics”, Campus ID Report 2 (3) (March/April 1997) 6–7.
[6] “Bellcore’s shaky smart card ‘threat model’ shakes up card industry”, Campus ID Re-
port 1 (3) (May 1996) 6.
[7] Benini L., et al., “Energy-efficient data scrambling on memory processor interfaces”, in:
ISLPED’03, Seoul, South Korea, 25–27 August 2003, pp. 26–29.
[8] Berinato S., “Smart cards: The intelligent way to security”, Network Computing 9 (9)
(15 May 1998) 168.
[9] Block V., “Looking beyond chips, Motorola plans full range of smart card products”,
American Banker 62 (57) (25 March 1997) 14.
[10] Bolchini C., et al., “Logical and physical design issues for smart card databases”, ACM
Trans. Inform. Systems 21 (3) (July 2003) 254–285.
[11] Briney A., “A smart card for everyone?”, Information Security (March 2002).
[12] “Campus ID cards in the library: The land of the sacred bar code”, Campus ID Re-
port 1 (6) (May 1996) 8.
[13] “Card industry lexicon”, Campus ID Report 2 (3) (May 1996) 8.
[14] “Card industry lexicon: Understanding your campus card industry lingo”, Campus ID
Report 1 (4) (June 1996) 7.
[15] “Card Europe UK—background paper. Card Europe: The association for smart card and
related industries” (online), http://www.cardeurope.demon.co.uk/rep1.htm, 1994. Ac-
cessed 29 April 2000.
[16] Chan A.T., Mobile cookies management on a smart card. Unpublished paper in review,
2003.
[17] Clendening J., “EDS announces US$1 billion in global IT business”, PR Newswire
(3 May 2001).
[18] Costlow T., “Major players bet on new smart card standard”, Electronic Engineering
Times 965 (4 August 1997) 6.
[19] Craig B., “Resisting electronic payment systems: burning down the house?” Economic
Commentary (July 1999).
[20] Croghan L., “Chase, Citi put their money on smart card pilot program”, Crain’s New
York Business 13 (39) (29 September 1997) 1.
[21] Cross R., “Smart cards for the intelligent shopper”, Direct Marketing 58 (12) (April
1996) 30–34.
[22] “D2T2: Printing on plastic”, Campus ID Report 1 (4) (June 1996) 8–9.
SMART CARDS 189
[23] Davis D., “Chip cards battle bar codes for big ID project”, Card Technology (March
2002).
[24] Donoghue J.A., “White hats/black hats: A pre-screened group of passengers may make
it easier to concentrate security efforts on the rest, but efforts to construct a ‘trusted trav-
eler’ system are off to a slow, uncoordinated start”, Air Transport World 38 (3) (March
2002).
[25] Eedes J., “Growth companies. Leaders of the technology pack”, Financial Mail (26 Oc-
tober 2001).
[26] “Eelskin wallets and other misconceptions about magnetic stripes”, Campus ID Re-
port 1 (8) (September 1996) 1, 5.
[27] “Electronic services use set to explode”, Bank Systems & Technology 34 (8) (August
1997) 15.
[28] Elliot J., “The one-card trick—multi-application smart card e-commerce prototypes”,
Comput. Control Engrg. J. 10 (3) (June 1999) 121–128.
[29] “Encryption part II: Certificates and certificate authorities”, Campus ID Report 1 (6)
(August 1996) 1, 6.
[30] “Evaluating the VISA cash pilot in Atlanta: A campus view”, Campus ID Report 1 (6)
(August 1996) 1, 5.
[31] Fancher C.H., “Smart cards”, Scientific American (1 August 1996) (online), http://www.
sciam.com/0896issue/0896fancher.html. Accessed 1 April 2000.
[32] “FDIC examines stores value: Should it be treated as a deposit?” Campus ID Report 1 (7)
(September 1996) 1, 5.
[33] Fletcher P., “Europe holds a winning hand with smart cards”, Electronic Design 47 (1)
(11 January 1999) 106.
[34] Flohr U., “The smart card invasion”, Byte 23 (1) (January 1998) 76.
[35] Ganeson P., et al., “Analyzing and modeling encryption overhead for sensor network
nodes”, in: WSNA’03, San Diego, California, USA, 2003, pp. 151–159.
[36] Gjertsen L.A., “Insurers turn to stored value cards”, American Banker 6 (116) (2001).
[37] Goedert J., Health data management (February 2000).
[38] Gray D.F., “Euro spurs I-commerce uptake”, InfoWorld 21 (15) (12 April 1999) 70.
[39] Grossschadl J., “Architectural support for long integer modulo arithmetic on RISC-based
smart cards”, Internat. J. High Performance Comput. Appl. 17 (2) (Summer 2003) 135–
146.
[40] Guyon J., “Smart plastic”, Fortune 136 (7) (13 October 1997) 56.
[41] Hempel C., “National ID card stirs a world of debate”, Knight-Ridder Tribune Business
News (19 December 2001).
[42] Husemann D., “The smart card: don’t leave home without it”, IEEE Concurrency 7 (2)
(April–June 1999) 24–27.
[43] Jain A., Hong L., Pankanti S., “Biometric identification”, Communications of The
ACM 43 (2) (February 2000) 90–98.
[44] Kutler J., “Visa-promoted CEPS making inroads in Europe, Asia”, American
Banker 163 (226) (25 November 1998) 11.
[45] Kutler J., “Cell phone-smart card hookup eyed for US after winning over Europe”, Amer-
ican Banker 163 (212) (4 November 1998) 1.
190 K.M. SHELFER ET AL.
[46] Kutler J., “Card giants using EMV as a stepping stone”, American Banker 164 (114) (16
June 1999) 11.
[47] Kutler J., “Java gets pats on back from card businesses in Belgium and France”, Ameri-
can Banker 164 (61) (31 March 1999) 16.
[48] Kutler J., “Visa regional operation adopts six-year plan for smart card conversion”,
American Banker 164 (109) (9 June 1999) 13.
[49] Kutler J., “A boost from Europe for card readers”, American Banker 164 (197) (13 Oc-
tober 1999) 18.
[50] Ladendorf K., “Americans wishing up to ‘smart card’ technology”, Knight-Ridder Tri-
bune Business News (1 October 2001).
[51] Leung A., “Smart cards seem a sure bet InfoWorld.com” (online), http://unix.idg.net/
crd_smart_69240.html, 8 March 1999. Accessed 29 April 2000.
[52] Lu C., dos Santos A.L.M., Pimental F.R., “Implementation of fast RSA key generation
on smart cards”, in: SAC 2002, Madrid, Spain, 2002, pp. 214–220.
[53] “Magnetic stripes: Track by track”, Campus ID Report 1 (1) (March 1996) 6–7.
[54] “Mag stripe stored value”, Campus ID Report 1 (10) (November 1996).
[55] Marcial G.G., “From cubic: The ID cards of tomorrow”, Business Week (19 November
2001).
[56] McGregor O., McCance O., “Use of chips smart?; Debate rages oven licenses”, Rich-
mond Times-Dispatch (19 February 2002) in Knight-Rider Tribune Business News.
[57] Moore A., “Highly robust biometric smart card design”, in: IEEE Transactions on Con-
sumer Electronics, vol. 46, 2000.
[58] Morrison D.J., Quella J.A., “Pattern thinking: Cutting through the chaos”, Marketing
Management 8 (4) (Winter 1999) 16–22.
[59] Muller H., “Europe’s Hi-Tech edge”, Time (31 January 2000) 28–31.
[60] Nairn G., “Survey—FT-IT: Fragmentation hinders medical market growth”, Financial
Times Surveys Edition (18 April 2001).
[61] Anonymous, “New homeland security department aims for IT compatibility”, Newsbytes
(7 June 2002).
[62] Newman D., “PKI: Build, buy or bust?”, NetworkWorld (10 December 2001).
[63] “Olympic visa cash trial to be the highest profile test of a stored value card to date”,
Campus ID Report 1 (4) (June 1996) 11.
[64] Orr T.L., “FEMA responds with technology”, Government Computer News 20 (8) (16
April 2001).
[65] OSCIE, Open smart card infrastructure for Europe v 2; Volume 6: Contactless technol-
ogy; Part 1: White paper on requirements for the interoperability of contactless cards.
Issued by eESC TB 8 Contactless Smart Cards, March 2003.
[66] “PAR technology subsidiary awarded $5.1 million government contract”, Business Wire
(31 May 2001).
[67] Phillips A., “Poised to take off”, Electronic Engineering Times Hot Markets Special
Report (October 2000) 22.
[68] Priisalu J., “Frequently asked questions list, Estonian institute of cybernetics” (online),
http://www.ioc.ee/atsc/faq.html, 4 July 1995. Accessed 30 April 2000.
SMART CARDS 191
[69] Proffitt D., “Travelers wise up, use smart card”, Business Journal (Phoenix) 16 (23)
(5 April 1996) 29–30.
[70] Radice C., “Smart cards hype or solutions?”, Grocery Headquarters 68 (1) (January
2002).
[71] Redd L., “Improving your return on IT”, Health Forum (2 July 2002).
[72] “Regulation E and Campus Cards”, Campus ID Report 1 (2) (April 1996) 1, 5.
[73] Reid K., “Pass and pay technology”, National Petroleum News 92 (2) (February 2000)
32–38.
[74] “RFID cards: How proximity cards operate”, Campus ID Report 1 (5) (July 1996) 8.
[75] Rogers E.M., Diffusion of Innovations, third ed., Free Press, New York, 1983.
[76] Rogers A., “European Parliament gains ground in health”, Lancet 347 (9009) (27 April
1996) 1180.
[77] Sanchez-Reillo R., “Securing information and operations in a smart card through bio-
metrics”, IEEE (2000).
[78] Sanchez-Reillo R., “Smart card information and operations using biometrics”, IEEE
Aerospace and Electronic Systems Magazine (April 2001) 52–55.
[79] Sanchez J.S., “Disaster protection for financial data”, Financial Executive 17 (9) (De-
cember 2001).
[80] Schacklett M., “These business trends will shape the future of e-commerce”, Union Mag-
azine (January 2000) 14–15.
[81] Shelfer K.M., “Intersection of knowledge management and competitive intelligence:
Smart cards and electronic commerce”, in: Knowledge Management for the Information
Professional, Information Today, Medford, NJ, 1999.
[82] Shelfer K.M., Procaccino J.D., Communications of the ACM 45 (7) (July 2002) 83–88.
[83] Shelfer K.M., Procaccino J.D., National ID card programs: Requirements engineering,
Preliminary results, unpublished paper.
[84] Shelfer K.M., Procaccino J., Smart health cards: Generating new revenues from old ser-
vices, unpublished paper.
[85] “Smart cards 101: Different chips, different terms”, Campus ID Report 1 (2) (April 1996)
10–11.
[86] “Smart cards 102: Basic operations of an IC card”, Campus ID Report 1 (9) (November
1996) 5.
[87] Spurgeon B., “Big brother aside, smart ID cards are making global converts”, Interna-
tional Herald Tribune (16 November 2001).
[88] Thaddeus J., “Disaster strategy: Bring continuity from calamity”, Computerworld 35 (7)
(12 February 2001).
[89] “The ins and outs of encryption”, Campus ID Report 1 (5) (May 1996) 1, 4–5.
[90] Thibodeau P., “License bill could create IT headaches”, Computerworld 36 (17) (22
April 2002).
[91] Tillett S., “INS to issue digital green card”, Federal Computer Week (8 September 1997).
[92] “Time to get smart. (Smart cards and EMV)”, Cards International 14 (7 May 2003).
[93] “Understanding the Buckley amendment: Is your campus card violating privacy laws?”,
Campus ID Report 1 (2) (April 1996).
192 K.M. SHELFER ET AL.
[94] “Understanding the 10-digit ISO number”, Campus ID Report 2 (2) (February 1997) 1,
5.
[95] Webb C.L., “Tech firms still waiting for floodgates to open”, Newsbytes (15 April 2002).
[96] Weinberg N., “Scare tactics”, Forbes (4 March 2002).
[97] Whitford M., “Unlocking the potential”, Hotel & Motel Management 214 (3) (15 Feb-
ruary 1999) 61.
Shotgun Sequence Assembly
MIHAI POP
The Institute for Genomic Research (TIGR)
Rockville, MD 20850
USA
mpop@tigr.org
Abstract
Shotgun sequencing is the most widely used technique for determining the DNA
sequence of organisms. It involves breaking up the DNA into many small pieces
that can be read by automated sequencing machines, then piecing together the
original genome using specialized software programs called assemblers. Due to
the large amounts of data being generated and to the complex structure of most
organisms’ genomes, successful assembly programs rely on sophisticated algo-
rithms based on knowledge from such diverse fields as statistics, graph theory,
computer science, and computer engineering. Throughout this chapter we will
describe the main computational challenges imposed by the shotgun sequencing
method, and survey the most widely used assembly algorithms.
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
2. Shotgun Sequencing Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
3. Assembly Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
3.1. Shortest Superstring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
3.2. Overlap-Layout-Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
3.3. Sequencing by Hybridization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
3.4. Hierarchical Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
3.5. Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
4. Assembly Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
4.1. Overlap Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
4.2. Error Correction and Repeat Separation . . . . . . . . . . . . . . . . . . . . . . 222
4.3. Repeat Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
4.4. Consensus Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
4.5. Scaffolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
4.6. Assembly Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
5. Exotic Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
5.1. Polymorphism Identification and Haplotype Separation . . . . . . . . . . . . . 237
1. Introduction
In 1982 Fred Sanger developed a new technique called shotgun sequencing and
proved its worth by sequencing the complete genome of the bacteriophage Lambda
[1]. This technique attempted to overcome the limitations of sequencing technologies
by breaking up the DNA at random. Sequencing techniques were only able to “read”
several hundred nucleotides at a time. The resulting pieces were assembled together
based on the similarity between pieces derived from the same section of the original
DNA molecule. The large amount of data produced by shotgun sequencing made
it necessary to utilize computer programs to assist the assembly [2,3]. Despite con-
tinued improvements in sequencing technology and the development of specialized
assembly programs, it was unclear whether shotgun sequencing could be used to se-
quence genomes larger than those of viruses (typically 5000–100,000 nucleotides).
For larger genomes it was thought that the complexity of the task would pose an
insurmountable challenge to any computer program.
In 1995, however, researchers at The Institute for Genomic Research (TIGR) suc-
cessfully used the shotgun sequencing technique to decipher the complete genome of
the bacterium Haemophilus influenzae [4]. The sequencing of this 1.83 million base
pair genome required the development of a specialized assembly program [5] as well
as painstaking laboratory efforts to complete those regions that could not correctly
be assembled by the software. The success of the Haemophilus project started a
genomics revolution with the number of genomes being sequenced every year in-
creasing at an exponential rate. At the moment the genomes of more than 1000
viruses, 100 bacteria, and several eukaryotes have been completed, while multiple
other projects are well on the way to completion. In parallel with the large amounts
of genomic data becoming available, the genomic revolution led to the birth of a new
field—bioinformatics—bringing together an eclectic mix of scientific fields such as
computer science and engineering, mathematics, physics, chemistry, and biology.
Critics of the shotgun sequencing approach continued to question its applicability
to large genomes despite obvious successes in sequencing bacterial genomes. They
argued the technique would be impractical in the case of large eukaryotic genomes
because repeats—stretches of DNA that occur in two or more copies within the
SHOTGUN SEQUENCE ASSEMBLY 195
genome—would hopelessly confuse any assembler [6]. The standard procedure for
handling large genomes was a hierarchical approach involving breaking up the DNA
into large (50–150 kbp) pieces cloned in bacterial artificial chromosomes (BACs),
and then sequencing each BAC through the shotgun method. Most such criticism was
silenced in 2000 by the successful assembly at Celera of the genome of Drosophila
melanogaster [7] from whole-genome shotgun (WGS) data. The assembly was per-
formed with a new assembler [8] designed to handle the specific problems involved
in assembling large complex genomes. The researchers from Celera went on to as-
semble the human genome using the same whole-genome shotgun sequencing tech-
nique [9]. Their results were published simultaneously with those from the Interna-
tional Human Genome Sequencing Consortium, who used the traditional hierarchical
method [10]. Independent studies [11,12] later showed that the two assemblies pro-
duced similar results and many of the differences between them could be explained
by the draft-level quality of the data. The applicability of the WGS method to large
genomes was thus proven though some continue to argue the validity of Celera’s
results (some opinions on this topic are presented in [13–17]).
Celera’s success combined with the cost advantages of the WGS technique—
Celera sequenced and assembled the human genome in a little over a year while
the international consortium’s efforts had been going on for more than 10 years—
renewed interest in the WGS method and led to the development of several WGS
assembly programs: Arachne [18,19] at the Whitehead Institute, Phusion [20] at the
Sanger Institute, Atlas [21] at the Baylor Human Genome Sequencing Center, and
Jazz [22] at the DOE Joint Genome Institute. Most current sequencing projects have
opted for a WGS approach instead of the hierarchical approach. For example the se-
quencing of the mouse [23], rat [24], dog [25], puffer fish [22], and sea squirt [26]
all follow the WGS strategy.
The current issue of debate is the suitability of whole-genome shotgun sequencing
as the starting point in the efforts to obtain the complete sequence for a genome. All
sequencing strategies start by building a backbone, or rough draft, of the genome
whose gaps need to be filled in through further laboratory experiments. It is still not
clear which sequencing strategy will ultimately be the most efficient in obtaining the
complete sequence of an organism, especially as none of the large eukaryotic projects
have yet been finished, except for the 100 Mbp genome of the nematode Caenorhab-
ditis elegans, finished in October 2002. (The genomes of Drosophila melanogaster
and human are expected to be mostly finished before the end of 2003.)
Despite significant differences in the overall structuring of the sequencing process,
all sequencing strategies rely on shotgun sequencing as a basic component. The
reader is referred to [27,28] for an in-depth discussion of current approaches to se-
quencing. The following sections represent a description of the shotgun sequencing
technique, with a emphasis on the algorithmic challenges imposed by this technique.
196 M. POP
1 cloning vectors—DNA molecule from a specific host (virus, bacterium, or another higher organism)
into which a DNA fragment can be inserted such that it will be replicated by the host organism.
2 fosmid—cloning vector that accepts DNA inserts of about 40 kbp with a very tight size distribution.
3 bacterial artificial chromosome (BAC)—cloning vector used to clone large DNA fragments (100–
300 kbp).
SHOTGUN SEQUENCE ASSEMBLY 197
DNA, and, therefore, the middle of the fragments remains unsequenced. This leads
to pairs of reads (also called mate-pairs), obtained from opposite ends of a same
fragment, which are naturally related. Such pairing information is essential to all
modern assembly algorithms.
In the most basic formulation, the task of an assembler is to connect together the
reads produced by the shotgun method in order to recover the original DNA se-
quence in a process not unlike putting together a jigsaw puzzle. The assembler uses
the sequence similarity between two reads to infer that they may originate from the
same region of the genome. This assumption is incorrect in the case of repeats—
stretches of DNA that occur in multiple identical or near-identical copies through-
out the genome—where the assembler may incorrectly combine reads coming from
distinct copies of a particular repeat. The reader familiar with jigsaw puzzles has un-
doubtedly encountered this situation when trying to put together pieces of sky. In the
best case repeats lead the assembler to generate more than one contiguous section of
DNA (called a contig), while in the worst case the assembler may incorrectly recon-
struct the original DNA. For example in Fig. 2 the repeat R tricks the assembler into
swapping the order of sections I and II of the genome. The read pairing information
can help the assembler detect and correct such errors, for example in the previous
example the rearrangement invalidates the link between reads A and B.
Even in the absence of repeats, however, the output of the assembler may consist
of more than one contig. This phenomenon can be explained with a simple analogy.
Imagine a sidewalk as it starts to rain. As droplets fall, the sidewalk becomes in-
creasingly wet, yet many spots remain dry for a while. Similarly, as the fragments
are being sequenced, the randomness of the shearing process leads to sections of the
original DNA not represented in the collection of reads. Therefore the best possible
F IG . 2. Genome rearrangement around a repeat. The top section of the picture represents the correct
layout. Each copy of repeat R is colored in a different color. The bottom image shows a possible mis-
assembly due to repeat R.
198 M. POP
than $1.00 per read, and still falling). Targeted sequencing, however, involves direct
human intervention and is therefore much costlier. The genome size is another factor
in this decision: a mammalian genome is 1000 times larger than a bacterial genome,
and correspondingly more expensive. Several attempts [30,31] have been made to
determine the most cost-effective strategy for sequencing a genome to completion,
yet a universally accepted solution is not yet available.
In the previous paragraphs we discussed the situation when the particular genome
being sequenced is considered given and we are attempting to model an idealized
shotgun sequencing experiment. Li and Waterman [32] address the dual question:
given a set of shotgun reads, what conclusions can we draw about the structure of
the genome being sequenced? Specifically, they attempt to determine the length and
repeat content of the original genome that best explains the observed set of shotgun
reads. This problem has great importance in practical applications since the specific
200 M. POP
F IG . 4. Fragment overlaps can only be detected when corresponding reads (represented by the thick
segments) overlap. The relationship between fragments A and B or fragments A and C cannot be detected.
The overlap between fragments B and C is implied by the overlap of their end-reads.
characteristic of the genome being sequenced are often unknown before sequencing.
Moreover, a solution to this problem can provide an invaluable method for quality
control during the shotgun process, by highlighting discrepancies between the ex-
pected structure of the genome and the information inferred from the shotgun data.
Li and Waterman’s approach is based on analyzing the frequencies of occurrence of
all l-tuples (strings of length l) present in the reads then matching the observed dis-
tribution to the expected values using a standard Expectation–Maximization (EM)
algorithm. They also provide an algorithm for characterizing the structure of the re-
peats present in the genome.
The Lander–Waterman statistics described above represent an approximation of
a true sequencing project. They assume all reads have the same length L, a case
not encountered in practice. Furthermore, as described above, sequencing reads rep-
resent a fraction of the shotgun fragments, as only the ends of the fragments are
being sequenced. The overlap between two fragments cannot be determined unless
the sequenced ends overlap. An example is shown in Fig. 4. These limitations of
the Lander–Waterman model were addressed by Arratia et al. [33]. They examine
a model where the overlap between two fragments can only be detected at particu-
lar “anchors” distributed along the genome. The concept of anchors is very general.
It may include short restriction sites4 commonly used in physical mapping experi-
ments (the original focus of these statistical analyses), however it also encompasses
sequencing reads. Furthermore, the authors address the issue of variable fragment
lengths by assuming a particular distribution of fragment sizes. They obtain an inter-
esting result: if fragment lengths are sampled from a normal distribution, an increase
in the standard deviation for a particular mean leads to three events: (i) the expected
number of contigs decreases; (ii) the expected contig length increases, and (iii) the
expected fraction of the genome covered by contigs increases. Variability in read
length therefore has a beneficial effect on the outcome of a shotgun-sequencing ex-
periment.
Arratia et al.’s main contribution however is to provide a first statistical analysis
of scaffolds, a concept essential to all modern assembly algorithms. The relation-
4 restriction sites—short stretches of DNA that are recognized by specialized proteins (restriction en-
zymes) which cut the DNA at these sites.
SHOTGUN SEQUENCE ASSEMBLY 201
ship between the two sequence reads derived from the same fragment can provide
useful linking information between the two contigs containing the reads. The esti-
mated length of the fragment provides an estimate for the size of the gap between
the contigs, while the orientation of the reads within the contigs determines the rel-
ative orientation of the two contigs as shown in Fig. 5. Note that the orientation
attribute of a read or contig is a direct consequence of the double-strand structure of
DNA. Each strand has an implicit orientation determined by the direction in which
DNA is replicated, always from one end (denoted 5’) to the other (denoted 3’). The
two complementary strands have opposite orientations, however the global orien-
tations defined by the DNA molecule that is being sequenced are lost through the
shearing process. For any particular read it is therefore impossible to determine from
which strand of the original molecule it originated. Assembly programs reconstruct
one of the two strands (the other one is simply the reverse complement of the first)
though not necessarily the same strand for all the contigs. It is therefore important to
determine the relative “orientation” of the contigs, specifically whether two contigs
represent the same DNA strand, or they come from opposite strands. The contigs
thus placed in a consistent order and orientation form a scaffold—term first intro-
duced by Roach et al. in [34]. An extension of the Lander–Waterman statistics in the
case of end-sequenced fragments was presented by Port et al. in [35]. They consid-
ered a collection of fixed-length fragments and analyzed a simplified definition of
scaffolds, specifically scaffolds that can be greedily generated by iteratively attach-
ing fragments to their rightmost end. They ignore the effect of transitively inferred
relationships, i.e., situations when the overlap between two fragments can only be
inferred from their overlaps with a third fragment (see Fig. 6). Yeh et al. (Yeh et al.
manuscript submitted [36]) extended this analysis to the general case.
Our inability to sequence more than just the ends of each fragment leads to an
interesting situation. Scientists must sequence enough fragments to guarantee that
most of the genome will be represented by sequencing reads. As described by the
Lander–Waterman statistics, one needs to sequence roughly 8 times the size of the
genome in order to guarantee that almost all bases (99.9%) are contained in at least
one of the sequencing reads. The reads, however, represent only a fraction of the
fragments. The number of fragments needed to provide the required 8× coverage by
reads, implies a “fragment” or “clone” coverage of the genome greater than that by
reads alone. Intuitively, each fragment contributes to the overall coverage both the
202 M. POP
F IG . 6. The overlap between fragments A and C can be inferred from their overlaps with fragment B
even though none of their reads overlap.
section covered by reads, and the unsequenced portion of the fragment. To clarify
this phenomenon we provide a simple example. Assume a 1 Mbp genome covered
by 2 kbp fragments. Also assume that the sequencing reads are all 500 bp long. Se-
quencing at 8× read coverage requires 16,000 reads, and therefore 8000 fragments.
These fragments cover 16 Mbp, and thus cover the genome 16 times (16× fragment
coverage). This effect is even more pronounced in the case of longer fragments.
Mathematically, the relationship between fragment coverage (fc) and read coverage
(rc) is
flen
fc = rc ·
2 · rlen
where flen and rlen are the average fragment and read lengths, respectively. Applying
the Lander–Waterman statistics to the fragments shows that virtually every base of
the genome is contained in at least one fragment. This is a very important fact since
it implies that those bases not represented in the collection of reads will likely be
covered by at least one fragment. The fragments can later be used as a substrate for
specialized sequencing reactions targeting the unsequenced bases.
Port et al. also introduced a statistical framework for handling the case of non-
uniform sampling of the genome, a commonly encountered situation. Often times
certain characteristics of the DNA fragments, such as short stretches of repetitive
DNA, prevent their efficient replication in the cloning vector [37]. As a result, some
sections of the genome are represented poorly or not at all in the fragment collection.
Port et al. [35] were the first to mathematically address these situations by model-
ing the cloning bias as an inhomogeneous Poisson process. It is important to discuss
some of the implications of non-random libraries on assembly software. Let us as-
sume that the genome being sequenced contains a section (marked T in Fig. 7) that is
toxic to the cloning vector, and therefore not present in the collection of sequenced
reads. T’s toxicity implies that the library does not contain any fragment that encom-
passes a significant portion of the region, therefore T remains largely unsequenced,
except for the ends being covered by reads from un-affected fragments. The more
important effect, however, is the absence of linking information across this region,
since the linking data is inferred from the fragments. The toxic region thus prevents
the formation of scaffolds that span it, which in turn means that any assembly will
break at this region.
SHOTGUN SEQUENCE ASSEMBLY 203
F IG . 8. Contig adjacency as inferred from PCR reaction. The length of the product between primers
P1 and P2 provides an estimate on the size of the gap between contigs A and B.
The relative order and orientation of the resulting scaffolds is generally deter-
mined through laborious experimental techniques. The basic idea is to use of the
polymerase chain reaction5 (PCR) [38] technique to amplify DNA directly from
the genome, without using a cloning vector. Scientists design PCR primers6 (short
stretches of DNA that provide a starting point for the PCR reactions) at the ends of
all scaffolds, then perform a set of PCR reactions in order to find pairs of primers
that are related. The technical issues in running PCR reactions are complex, however
the basic idea is that the PCR reaction will amplify the region between two primers if
they are actually adjacent in the genome. The size of the PCR product7 (the ampli-
fied section) provides insight in the actual distance between the primers (see Fig. 8).
In the case of an assembly with N2 scaffolds we need to determine the adjacency
information for N different scaffold ends. In essence we must perform N(N − 1)
PCR-comparisons to determine which ends are next to each other. For many bac-
terial genomes N is less than 100, however a typical eukaryotic genome may yield
hundreds or thousands of scaffolds. For simple genomes we need to perform hun-
dreds of PCR reactions, while for large genomes the technique becomes practically
infeasible and we therefore need additional information to order and orient the scaf-
folds. Nonetheless, this experimental technique is widely used in practice and efforts
were made to improve it. A related technique called multiplex PCR8 [39] was de-
veloped that allows scientists to pool together several primers (up to about 40) and
5 polymerase chain reaction (PCR)—laboratory technique through which a DNA segment is rapidly
replicated.
6 PCR primer—short (tens of base pairs) stretch of DNA that provides a starting point for the PCR
reaction.
7 PCR product—the DNA segment amplified by the PCR reaction.
8 multiplex PCR—technique that allows multiple PCR reactions to be performed at the same time.
204 M. POP
effectively test all possible adjacencies of the primers in the pool in one step. The
result of this procedure is a certificate that one or more pairs of primers are adja-
cent, unfortunately no information is provided as to which pairs were successfully
tested. Further tests are necessary to determine exactly which primers were adjacent.
Tettelin et al. [40] introduced the following problem: given N PCR primers, and a
limit K on the number of primers that can be included in a single multiplex pool,
determine the minimum number of reactions needed to check√all possible primer
adjacencies. The authors provide a solution that requires P + P reactions where
P is a perfect square with the property that N P < K. The authors notice, however,
that while this solution optimizes the number of reactions, it does not minimize the
number of laboratory operations needed to perform these reactions. For each pool
scientists need to pipette all the primers into a reaction tube. Because this pipetting
operation is the source of most laboratory errors, Tettelin et al. proposed a solution
that attempts to minimize the number of pipetting steps. Their solution,√an algorithm
called Pipette Optimized Multiplex-PCR (POMP) requires 2N − N pipetting
operations, however it cannot be proven optimal. After a first step of multiplex-PCR
reactions, the results of the POMP method need to be deconvoluted in order to ob-
tain the actual pairs of adjacent primers, leading to the need for additional reactions.
They do not present a theoretical analysis of the number of such additional reactions
required by the technique. Beigel et al. [41] extended the theoretical framework by
addressing the problem of correctly identifying all primer adjacencies in a reaction-
efficient manner. They prove a lower bound of (n log n) reactions needed to de-
termine all the adjacencies and describe a multi-round solution to the problem that
is within a constant factor of the optimal bound. Alon et al. [42] further extend the
analysis and propose a probabilistic non-adaptive solution that matches the lower
bound to within a constant factor. Non-adaptive solutions are useful as they enable
the automation of the procedure.
This scaffold ordering step is part of the final stage of a sequencing project called
gap closure9 or finishing. This labor-intensive stage encompasses a variety of bioin-
formatics and laboratory techniques meant to “fill in” the remaining gaps in the
genome map, as well as to detect and correct mis-assembled repeats. The finishing
stage is one of the most expensive stages of a sequencing project and thus can ben-
efit from the support of specialized software. As an example, we refer the reader to
several papers describing the finishing support tools included in the commonly used
phred-phrap package [43–45]. Furthermore, Czabarka et al. [30] mathematically an-
alyze the closure process and present optimal solutions (in terms of the number of
finishing reactions needed) in the context of several theoretical models of the fin-
ishing stage. It is worth observing the synergy between the development of shotgun
9 gap closure or finishing—process through which the gaps between the contigs produced by the as-
sembler are closed through targeted sequencing reactions.
SHOTGUN SEQUENCE ASSEMBLY 205
sequence assembly programs and the experience of human experts involved in the
finishing process. Most modern assemblers include features useful to the finishing
process by providing the types of information needed to guide additional laboratory
experiments. As an example, the Euler package contains a module that designs the
experiments needed to resolve over collapsed repeats [46]. TIGR Assembler [5] is,
at the moment, unique in its ability to allow the users to “freeze” certain assemblies
and thus manually guide the assembly process. At the same time, assembly programs
use criteria developed by finishing experts to detect and correct mis-assemblies. The
Arachne assembler [19] includes such a module.
The following sections will describe in detail the various problems associated with
shotgun sequence assembly. Section 3 is devoted to the discussion of the most com-
mon assembly strategies at a conceptual level. Practical details of implementing spe-
cific tasks required by assembly program are discussed in Section 4. Finally, Sec-
tion 5 will describe several new challenges to assembly programs.
3. Assembly Paradigms
In its most general form the sequence assembly problem involves reconstructing
the genome from the shotgun reads based on sequence similarity alone. This problem
can be further decomposed into two problems: the mapping or layout problem, in
which all reads need to be positioned correctly in the genome, and the consensus
problem, in which the contiguous DNA sequence of the genome is computed. It can
be easily seen that in this formulation the general problem is impossible to solve.
For example, consider a genome of length 20 composed entirely of the character A.
Assume that 8 random fragments of length 5 are sampled from the genome as shown
in Fig. 9. The layout problem clearly has no unique solution—indeed any placement
of the reads along the genome is possible. It can be argued that ultimately it is the
DNA sequence of the original molecule that needs to be reconstructed, the exact
placement of all the reads being irrelevant. This problem (the consensus problem) is,
however, also impossible to solve in this case. The only information we can glean
from the reads is that the genome is at least 5 base pairs long. Any string of As of
at least 5 bases can be explained by the reads, indeed Fig. 9 shows some possible
reconstructions. It is thus clear that in the general case additional information is
required to solve the assembly problem. Please note that the theoretical example
we chose is not entirely unrealistic. By replacing the character A with any other
DNA string, the example becomes the problem of assembling tandem repeats: short
stretches of DNA that occur in many copies, in tandem, throughout the genome. Such
repeats are very common, for example in the telomeric and centromeric regions of
many eukaryotic chromosomes [47].
206 M. POP
F IG . 9. Possible assemblies of the same set of 8 reads. The string in bold-type represents the supposed
consensus sequence. a) represents the correct layout, b) and c) represent incorrect assemblies.
of the genome, and the assumption that the fragments were selected through a
uniform random process, allow us to favor the reconstruction in Fig. 9(a) over
that in Fig. 9(c) and thus correctly resolve the repeat without the need to find the
exact location of each read. Statistical methods allow an assembler to validate a
particular layout by estimating the probability that the layout would occur in a
random shotgun experiment. This approach was introduced by Myers [48] and
is described in more detail in Section 4.2 in the context of repeat identification.
In practice, all the data provided to the assembler contain errors, making it difficult
or impossible for the assembler to correctly identify the original DNA sequence. The
assembly problem should thus be restated as having the goal to obtain a sequence that
is “close” to the original sequence. There is, unfortunately, no universally accepted
quality standard, though a widely used measure requires less than 1 difference in
every 10,000 bases between the assembled sequence and the original DNA. This
standard, commonly referred to as the Bermuda standard, was developed in the
context of the sequencing of the human genome by an international consortium (for
a summary of this standard see [49]).
The original greedy approach forms the basis of the first assembly algorithms and
thus warrants a more detailed explanation. Its basic structure is:
1. add all the reads to a common store as singleton contigs (contigs with only one
read)
2. find the two contigs with the best overlap, combine the contigs and add the
resulting contig to the store
3. repeat the prior step until no more contigs can be joined.
This algorithm needs to be modified in order to be used in practice. Sequencing reads
usually contain sequencing errors, inherent to all current sequencing technologies.
Some amount of error is also introduced by the cloning technologies. In the presence
of errors, the shortest superstring problem can be formulated as the task of finding the
shortest sequence S that contains each read as an “approximate” sub-string, that is,
each read r matches a substring of S with less than e errors, where e is an estimate of
the error rate. Unsurprisingly, this problem was also shown to be NP-complete [55].
In the presence of errors, the definition of overlap from step 2 of the greedy al-
gorithm needs to be modified to account for these errors. Two reads are said to
overlap if they align such that the only differences between them can be explained
by sequencing or cloning errors. The differences, or the edit-distance between the
reads, are identified through a standard alignment algorithm (e.g., [56]) and only
those alignments with low error rates are used. Assembly algorithms can typically
tolerate alignments with between 2.5% and 6% insertion/deletion rates. Furthermore,
the alignment between the sequences must be “proper”, that is, the alignment must
start and end at sequence ends. This requirement addresses spurious overlaps caused
by repeats (see Fig. 10).
The redefinition of the notion of overlap requires us to reexamine the quality of
an overlap. The “best” overlap is no longer the longest one. Practical implementa-
tions use one or more of the following measures when determining the quality of an
overlap:
• Length of the overlap region;
• Number of differences, or conversely the percentage of bases shared by the two
sequences as a fraction of the length of the overlap;
• Score of the alignment comprising four elements: reward for a good match,
substitution penalty, gap opening penalty, and gap extension penalty;
• Quality adjusted score of the alignment—the four components of the score are
adjusted to take into consideration the quality scores assigned by the base caller;
• Number of mate-pairs confirming (or conflicting with) the overlap.
SHOTGUN SEQUENCE ASSEMBLY 209
F IG . 10. Proper (a) and repeat induced (b) overlap between two reads (caused by repeat R). (c) repre-
sents a proper overlap in the case when errors are present in the reads. The unaligned regions are called
“overhangs”.
3.2 Overlap-Layout-Consensus
The original greedy approach to sequence assembly is inherently local in nature
as only those contigs being merged are examined by the algorithm. Longer range
interactions between reads can be considered, however this information is not easily
incorporated in the standard algorithm, leading to complex implementations. Peltola
et al. [57] and Kececioglu and Myers [48,58] introduced a new theoretical frame-
work that addresses the global nature of the assembly problem: the overlap-layout-
consensus (OLC) paradigm. They refine the problem by decomposing it into three
distinct sub-problems:
• overlap—find all the overlaps between the reads that satisfy certain quality
criteria.
• layout—given the set of overlap relationships between the reads, determine a
consistent layout of the reads, i.e., find a consistent tiling of all the reads that
preserves most of the overlap constraints.
• consensus—given a tiling of reads determined in the layout stage, determine
the most likely DNA sequence (the consensus sequence) that can be explained
by the tiling.
It is important to note that the greedy approach is, in some sense, an OLC algorithm,
despite the fact that in many implementations the three distinct stages are not clearly
delimited.
The main component of any OLC algorithm is the layout stage, originally for-
mulated in graph theoretic terms. The overlap stage generates a graph whose nodes
represent the reads. Two nodes are connected by an edge if the corresponding reads
are involved in a proper overlap (as defined by preset quality criteria such as those
discussed in Section 3.1). The layout problem can thus be defined as the task of find-
ing an interval sub-graph that maximizes a particular target function. In other words,
210 M. POP
the layout algorithm must find a sub-graph that contains only those edges represent-
ing overlaps consistent with a placement of the reads viewed as intervals along a line.
As an optimization target, Peltola et al. [57] look for a layout that “uses best possible
overlaps” while Kececioglu and Myers [58] attempt to maximize the weight of the
resulting sub-graph, given a set of weights corresponding to the quality of the over-
laps. They also show that, under this definition, the layout problem is NP-complete
and propose a greedy approximation algorithm as well as a method for enumerating
a collection of alternative solutions. Myers [48] further introduces a variant of this
problem that generates the layout that best matches the statistical characteristics of
the fragment shearing process under Kolmogorov–Smirnov statistics.
Layout algorithms operate on huge graphs (from tens of thousands of nodes in the
case of typical bacterial genomes to tens of millions in mammalian-sized genomes)
making the global optimization of the layout practically impossible. Practical imple-
mentations attempt to solve the layout problem in a greedy fashion, by starting with
those regions of the graph that can be resolved unambiguously. Thus Myers [48] pro-
poses a series of transformations that convert the initial overlap graph into a chunk
graph, where a chunk is a maximal interval sub-graph. In other words, chunks rep-
resent sections of the genome that can be unambiguously resolved, i.e., the sections
between repeats. Repeats cause branches in the overlap graph (see Fig. 11) and sig-
nal the end of a chunk. The complexity of the graph is thus greatly reduced allowing
the use of more sophisticated algorithms in order to obtain a consistent layout of the
chunks. Usually at this stage the chunk graph is augmented with additional informa-
tion, such as mate-pair data.
Variants of this idea were used in practical implementations, for example Celera
Assembler [8] (where the chunks are called unitigs: uniquely assembleable contigs)
and Arachne [18] start the assembly process by identifying the unambiguous sections
F IG . 11. Effect of repeats on overlap graph. a) represents the overlap of three reads in the absence of
repeats. The overlap graph contains an edge for each pair of reads. b) represents the overlap between three
reads at the boundary of a repeat. The overlap graph lacks the edge between A and C.
SHOTGUN SEQUENCE ASSEMBLY 211
of the genome then use a scaffolding step to generate the final layout. Arachne uses
the mate-pair information from the beginning of the assembly process by identifying
paired-pairs, that is, pairs of shotgun fragments of similar lengths whose end se-
quences overlap (see Fig. 12). Paired-pairs represent a high confidence structure that
can be used to seed the contig building process. In contrast, Celera Assembler uses
mate-pair information in the later stages of assembly.
F IG . 13. a) represents the uniform coverage of a genome with k-mers. b) represents the (uneven)
coverage by a randomly sheared set of fragments.
10 hybridization—process through which a short single stranded nucleotide string attaches to the com-
plementary string in the DNA strand being analyzed.
212 M. POP
The SBH problem can be represented in graph theoretic terms as follows: given
a set of k-tuples, we construct a graph whose nodes represent all the (k − 1)-tuples
in the set. Two nodes are connected by an edge if the corresponding (k − 1)-tuples
represent the prefix and the suffix of one of the original k-tuples. Thus the k-tuples
are implicitly represented by edges in the SBH graph. A solution to the SBH problem
is represented by a path through the graph that visits every single edge (k-tuple). This
problem is related to a classic problem in graph theory—the Eulerian path problem—
that requires finding a path through the graph that uses every single edge exactly
once. Note that in the case of SBH, a particular k-tuple may be used multiple times
if it occurs in a repeat.
The Eulerian path problem is generally easy to solve—if such a path exists it
can be found in linear time with respect to the number of edges in the graph. The in-
stances of the problem induced by SBH experiments, however, limit the applicability
of this approach to sequencing short pieces of DNA. On the one hand, the k-tuple
array must contain all possible k-tuples, thus physically limiting the size of k, and
implicitly the size of strings that can be sequenced by SBH. As an example, a 2-tuple
array containing the 16 possible di-mers cannot be used to sequence any DNA string
longer than 17 bases. On the other hand, hybridization errors and repeats contained in
the DNA string complicate the graph. While finding an Eulerian path is easy, the task
of finding the correct path—the path corresponding to the original DNA molecule—
is a much harder problem. For example, the graph in Fig. 14 can explain two different
DNA strings corresponding to reorganizations around repeat R. The graph alone does
not contain the information necessary to make the correct choice. Note however that
the repeat is immediately recognizable in the graph. Figure 15 contains the example
of a tandem repeat and the corresponding structure in the overlap and SBH graphs.
The repeat can be easily identified in the SBH graph, however it is not immediately
obvious in the overlap graph.
Despite the limited applicability of the SBH technique to the actual sequencing of
DNA, its theoretical structure leads to an alternative approach to shotgun sequence
F IG . 14. SBH graph for a 3-copy repeat R. The graph supports two different reconstructions of the
genome: ARBRCRD and ARCRBRD.
SHOTGUN SEQUENCE ASSEMBLY 213
F IG . 15. Tandem repeat (shaded regions in the top picture) and its representation in the overlap graph
(bottom left) and the SBH graph (bottom right). The numbered lines in the top region represent reads
while the short segments correspond to k-mers. The SBH graph does not contain a representation of all
the k-mers due to lack of space. The k-mer represented in gray spans the boundary between the two copies
of the repeat and is therefore unique in the genome. The loop in the graph corresponds to those k-mers
contained in the repeat region.
assembly. Idury and Waterman [59] proposed using the shotgun reads to simulate an
SBH experiment. They break up each read into overlapping k-mers. The combined
k-mer spectra of all the reads correspond to the k-mer spectrum of the original DNA,
and thus solving the SBH problem is equivalent to solving the initial shotgun se-
quence assembly problem. It is important to note that the in-silico SBH experiment
does not impose limitations on the size of k. The algorithms need only process those
k-mers actually present in the read set. For a genome of size G, we expect G − k + 1
such k-mers, a number that is generally much smaller than 4k (the set of all pos-
sible k-mers). The technique is thus, at least theoretically, applicable to arbitrarily
sized genomes. The authors notice, however, that sequencing errors lead to spurious
k-mers that greatly complicate the graph.
Idury and Waterman’s mainly theoretical work was extended by Pevzner et al.
[60–62] leading to a practical implementation, a software package called Euler. They
addressed several issues of practical nature. First of all, their algorithms depend on an
error correction module, since sequencing errors hopelessly tangle the Euler graph.
A description of this module is provided in Section 4.2. Secondly, they use the initial
reads as a guide in generating the Eulerian path (this idea was introduced by Idury
and Waterman), leading to a new problem—the Eulerian superpath problem: find
an Eulerian path that confirms most reads in the input. The constraints provided
by the reads help resolve most short repeats. Thirdly, the authors use the mate-pair
214 M. POP
This hierarchical method thus introduces the need for two specialized assembly
programs: one that performs the individual assemblies within each fragment, and one
that pastes together the finished fragments using the overlaps and the initial fragment
map as a guide. For the first task existing assembly program such as phrap and TIGR
Assembler are commonly used, as the small size of the BACs does not impose signif-
icant assembly challenges (though complex repeats remain a problem even in such a
localized context). The latter task is quite easy and, typically, the final assembly step
is performed either manually or through a collection of simple computer programs.
This was the approach used in sequencing the model plant organism Arabidopsis
thaliana [65]. The initial assembly of the human genome by the Human Genome
Sequencing Consortium, however, required a much more sophisticated approach,
given the quality of the available data. A special program, called GigAssembler [66],
was used to combine a collection of finished and partially finished BACs, as well as
many individual contigs. The problems were compounded by errors in the physical
mapping data and mis-assemblies of the contigs or BAC sequences. GigAssembler
uses techniques similar to those developed for shotgun-sequence assembly, therefore
some of the technical details of the implementation will be discussed later in that
specific context.
The hierarchical sequencing approach led to active research in the development of
specialized techniques for obtaining the initial BAC map. Such research addresses
both the laboratory technologies involved in physical mapping [67] as well as the
software issues involved in generating and analyzing such maps.
Researchers at the Baylor College of Medicine developed a BAC mapping tech-
nique that combines the cost advantages of shotgun sequencing with the simpler
algorithms required by the hierarchical method. They follow a hybrid approach
wherein the mapping of the BACs along the genome is replaced by a low cover-
age “light” shotgun of a collection of BAC clones. At the same time, the genome is
sequenced using the standard shotgun sequencing technique. The last step of their
technique involves mapping the shotgun reads to individual BACs, using the reads
generated through the light shotgun of the BACs as anchors. The BACs thus serve as
a guide to clustering the shotgun reads into more manageable blocks. This technique
represents the basis of the assembly program Atlas. Please note that, in the absence of
an actual BAC map, the final step of joining the individual BAC assemblies together
becomes considerably more difficult. Similar BAC “recruiting” techniques form the
basis of the Phusion [20] and RePS [68] assemblers and was also proposed as an al-
ternative to whole-genome-shotgun in the assembly of the human genome at Celera
[69].
Cai et al. [28] propose a refinement of this hybrid technique, called Clone Array
Pooled Shotgun Strategy (CAPSS). They place each BACs DNA within the cells
of a two-dimensional matrix, then pool the DNA within each row and column of the
216 M. POP
matrix as shown in Fig. 16. The light shotgun step is then applied to each pool, thus
reducing the number of libraries created. In the case of a collection of n BACs, the
initial approach√ requires the generation of n libraries, while the pooled DNA method
requires only 2 n libraries. Each BAC will thus be represented in the reads from
two different libraries, one from the row, and the other from the column containing
the BAC’s well. In the last step of CAPSS, the shotgun reads are assembled together
and the resulting contigs used to identify the correct mapping of reads to BACs.
The contigs that contain reads from both column i and row j of the CAPSS array
correspond to reads generated from the BAC clone in well (i, j ). Furthermore, this
strategy can be extended to also produce a map of the BAC clones, thus circumvent-
ing the need for an additional mapping step. This technique, called Pooled Genomic
Indexing (PGI) [70] requires two separate arrays for each set of BACs. The place-
ment of BACs in the wells is shuffled between the two arrays so that no two clones
occur within the same row or column in both arrays. As a result, the deconvolution
of the contigs also yields the relative placement of pairs of BACs, information that is
sufficient to generate a map of the genome.
It is important to note that hierarchical approaches are also very important in whole
genome shotgun sequencing. Indeed, such approaches are essential during the fin-
ishing stages of a sequencing project. As an example, the following hierarchical ap-
proach is commonly used to correctly assemble repeats. Clearly repeats are only a
problem if reads corresponding to two or more nearly-identical copies of a repeat are
being assembled at the same time. When faced with a potentially mis-assembled re-
peat, researchers attempt to identify fragments whose ends are anchored in the unique
SHOTGUN SEQUENCE ASSEMBLY 217
areas flanking a particular repeat copy. Note the importance of having libraries of
multiple sizes as they allow the resolution of different classes of repeats. The frag-
ments are then further sequenced, either through directed sequencing or through a
separate shotgun experiment (depending on size). The resulting reads can be safely
assembled together since they represent a single copy of the repeat. The resulting
contig, together with the flanking unique sequence can then be used as a building
block in assembling the rest of the genome, without the risk of mis-assembly. It is
important for the assembly program to allow such a hierarchical approach by provid-
ing a means for “jump-starting” the assembly with the already generated contigs.
4. Assembly Modules
The specific algorithmic approaches to the task have evolved throughout the years,
as increasingly more complex sequencing projects were tackled through the shotgun
method. The earliest algorithms involved either iteratively aligning each read to an
already generated consensus [2] or comparing all the reads against each other [57].
Most recently the detection of read overlaps involves sophisticated techniques meant
to reduce the number of pairs of reads being analyzed. For example in the case of the
human genome, a full pair wise comparison of all 50 million reads from a 5× shot-
gun sequencing experiment would be prohibitive, especially as, at least theoretically,
each read overlaps only a small number of other reads (approximately 5 other reads).
Furthermore, recent algorithms based on the sequencing-by-hybridization paradigm
[59,62], avoid the explicit computation of read overlaps since these are implicitly
represented in the graph constructed by such algorithms.
The overlaps implied by sequence similarity between reads fall in two classes:
“real” overlaps—the reads were obtained from the same region of the genome;
“repeat-induced” overlaps—the reads belong to two distinct repetitive regions of the
genome (see Fig. 10). Ideally, an assembler should only use the “real” overlaps since
“repeat-induced” overlaps lead to ambiguities in the placement of the reads. Such
ambiguities are often hard or impossible to resolve and most assemblers use several
heuristics to reduce the number of repeat-induced overlaps generated.
Two reads are said to overlap only if the overlap is proper, that is either one read
is entirely contained in the other, or the two reads properly dove-tail as shown in
Fig. 10(a). This heuristic avoids considering short repeats by eliminating the overlaps
represented in Fig. 10(b). The unaligned regions in this figure are called overhangs.
A second heuristic requires the reads to be highly similar in the region of overlap.
In the absence of sequencing errors, two reads that were generated from the same
region of the genome will have identical sequences, thus any “imperfect” alignment
would indicate an overlap induced by a repeat whose copies have diverged during
evolution. In practice, however, sequencing errors must be taken into account, and
therefore assembly algorithms must tolerate imperfect alignments. One or more of
the following “imperfections” are usually allowed when considering overlaps: base
substitutions, base insertions or deletions, and overhangs. Please note the delicate
balance between the sequencing error tolerated by an assembler, and the ability to
detect repeat induced overlaps. An algorithm that requires perfect overlaps between
reads would only be confused by exact repeats (identical stretches of DNA that oc-
cur in multiple places throughout the genome). Such an algorithm will, however,
only identify a small percentage of all true overlaps due to sequencing errors. An
algorithm tolerating a 3% sequencing error rate (this number corresponds to the esti-
mated error rates at the large sequencing centers) will identify most true overlaps. At
the same time the algorithm will identify more repeat-induced overlaps, specifically
those due to repeats that have less than 3% differences between repeat instances.
SHOTGUN SEQUENCE ASSEMBLY 219
ment. Such algorithms are commonly known as banded alignment algorithms. For
a more in-depth description of the various string alignment algorithms the reader is
referred to [87].
Besides identifying the specific overlap of two given reads, an overlap algorithm
must also determine which pairs of reads overlap. This step of the algorithm can also
be more efficiently implemented when the error rates are small. In the case of un-
bounded error rates an overlapper must examine all possible pairs of reads, leading to
a quadratic number of pair wise comparisons—an inherently inefficient process. The
following observation leads to a more efficient approach in the case when the num-
ber of errors tolerated by the algorithm is small. A low number of errors implies that
two overlapping reads must share several identical stretches of DNA. As described
above, algorithms that identify exact matches are very efficient leading to a two-step
process for identifying read overlaps. First, a set of short identical matches between
the reads is identified, then only those pairs of reads that share the same set of ex-
act matches are considered in more detail. Chen and Skiena [78] estimate that this
simple heuristic reduces by a factor of 1000 the number of pairs of reads that need
to be considered. Furthermore, the exact matches between two reads can be used to
“seed” their alignment, greatly reducing the amount of time required to perform a
detailed alignment. Such techniques are commonly used to speed up database search
algorithms such as BLAST [88] or FASTA [89]. The AMASS assembler [90] further
extends this approach by entirely skipping the detailed alignment step. Thus read
overlaps are identified by examining the pattern of shared exact matches, a detailed
alignment being postponed until the last phase of the assembly—the generation of
the final consensus.
Besides the suffix-tree and suffix-array techniques described above, the detection
of exact matches between pairs of reads is usually performed by building a map (usu-
ally under the form of a hash table) of all k-mers present in the reads, keeping track
of the set of reads that contain each k-mer. A simple pass through the table is suffi-
cient to identify all pairs of reads that have a particular k-mer in common. Variants
of this simple approach are used by virtually all assemblers used in practice [5,8,18,
20–22,64,66,91,92]. Tammi et al. [93] suggest an extension of this basic approach by
structuring the k-mer database in such a way as to allow querying for inexact word
matches. Specifically, they describe an method for finding all k-mers that have less
than d differences from a given query k-mer, where d is a parameter corresponding
to the expected sequencing error rate. Using this technique they hope to improve the
sensitivity of the overlap stage of the assembly.
The choice of the length parameter k affects the sensitivity of the overlap detec-
tion algorithm. Short k-mers occur frequently in the DNA sequence, leading to the
identification of many potential read pairs that need to be evaluated by the algorithm.
The use of long k-mers may cause the algorithm to miss many true overlaps due to
SHOTGUN SEQUENCE ASSEMBLY 221
the effect of sequencing errors. The k-mer size generally ranges between 10 bp (as
in GigAssembler [66]) and 32 bp (as in TIGR Assembler [5]), the specific choice
depending on the nature of the data processed by the assembler. The GigAssembler
was designed to tolerate the large error rates inherent to the heterogeneous nature of
the Human Genome Project, while TIGR Assembler could afford a more stringent
value due to the high quality of the data generated by shotgun sequencing of bacteria.
In general, assembly programs identify all distinct k-mers present in the reads.
Roberts et al. [94] note that it is sufficient to store only a sub-set of all k-mers there-
fore significantly reducing the time and space requirements of the overlap routine.
For each set of m consecutive k-mers—consecutive means each k-mer is shifted
by one base from the previous one—they only store the minimizer, i.e., the small-
est k-mer in terms of a specific lexicographic order. They show that any reads that
share an exact match of more than m + k − 1 bases must have at least one such
minimizer in common. For appropriately chosen values of m and k, the minimizer
technique greatly reduces the complexity of the overlap stage without missing true
overlaps. The authors also describe a procedure based on file sorting that allows
them to trade off expensive RAM memory for much cheaper disk space, without a
significant degradation in performance.
The overlap stage of an assembler trivially lends itself to parallelization. A set of n
reads can be partitioned into K sub-sets. This leads to K 2 distinct overlap tasks that
can be performed in parallel, corresponding to all possible pairings of the K sets. The
overlap task pairing sets i and j leads to the identification of all reads in set i that
overlap reads in set j . This approach was used at Celera in order to take advantage of
their large processor farm [8]. Note, however, that their approach converts a task that
can be solved in linear time into a quadratic process. Their technique can, therefore,
only provide an advantage over the single processor solution for small values of K.
Parallelization of the overlap stage was also proposed by Huang et al. [95] as a main
component of their assembler (PCAP) specifically designed to handle mammalian-
sized genomes.
The overlaps identified in this stage of assembly provide the input to the lay-
out stage. They are, however, also used to identify specific features of the genome:
chimeric reads, missed overlaps, repeats, and sequencing errors.
Chimeric reads (see Fig. 17) are an artifact of the sequencing process wherein
two distinct sections of the genome are represented in the same read. Such errors
are ubiquitous in gel-based sequencers though they have become much less common
since capillary-based sequencers have been introduced. They can also be an arti-
fact of the cloning process, due to the recombination of the DNA fragments. Since
chimeric reads do not represent any section of the genome, their overlaps with other
reads can be used to detect the “separation point” that is the place in the read where
the two distinct sections of the genome come together (see Fig. 18). Although rare,
222 M. POP
F IG . 17. Chimeric read. The read contains DNA from two unrelated sections of the genome.
F IG . 18. Identification of chimeric read from overlap with other reads. The breakpoint is not spanned
by any other read in the genome.
chimeric reads can confuse the assembler therefore some assembly packages include
a module that detects and eliminates potential chimeras [5,18,64,92].
Chen and Skiena [78] also propose a method for identifying those overlaps that
might have been missed by a stringent overlap algorithm. They identify the transitive
relationship between three reads where only two of the overlaps had been identi-
fied (see Fig. 11(b)). Such a situation, commonly induced by repeats, can also be
caused by sequencing errors, therefore it is useful to attempt to identify such over-
laps missed by a stringent overlap algorithm. Finally, the overlaps between reads can
be used to identify and correct sequencing errors, and to detect repetitive regions that
might confuse the layout stage. These problems will be discussed in detail in the next
section.
F IG . 19. Correlated “errors” between reads can be evidence of mis-assembled repeats. Columns 1 and
3 represent such evidence. The disagreement in column 2 is most likely an error as it occurs in only one
of the reads.
path approach of Pevzner et al. [60] is affected by sequencing errors to such an extent
that their algorithms depend on an error correction module.
It is therefore not surprising that many of the recent developments in the field
of sequence assembly address specifically the task of automatically correcting se-
quencing errors during a pre-processing stage of assembly. The basic idea behind all
error correction approaches is statistical in nature. It assumes that sequencing errors
occur in a random fashion within each read, furthermore the distributions of errors
in distinct reads are independent of each other. The probability that two overlapping
reads would contain the same sequencing error at the same exact location is therefore
practically negligible. Within a tiling of reads corresponding to a specific section of
the genome an error at any particular position would occur in only one of the reads
(see Fig. 19). Correlated errors between reads represent strong evidence of either the
presence of multiple distinct copies of a repeat, or the existence of multiple divergent
haplotypes11 in the DNA being sequenced. Please note that this discussion only ap-
plies to sequencing errors which can, in practice, be considered as the outcome of a
random process. Other types of errors occur in shotgun sequencing projects, which
do not have the same random behavior as sequencing errors. For example, the pres-
ence of long stretches of a same repeated nucleotide causes the sequencing reaction
to “slip” leading to errors in all the reads containing the particular sub-sequence.
The most commonly used sequencing technique involves the identification of flu-
orescently tagged DNA as it passes in front of a detection mechanism. The physical
output of a sequencer consists in four signals corresponding to the four different
nucleotides (see Fig. 20). Specialized programs (called base-callers) use signal-
processing techniques to identify the individual bases composing the DNA strand
11 haplotype—Eukaryotic genomes generally contain two copies of each chromosome. Each copy is
obtained from one of the parents, thus the two copies may differ from each other. Each of the alternative
forms of the genotype (complement of genes) corresponding to the two chromosomes is called haplotype.
224 M. POP
F IG . 20. The four signals, corresponding to each of the four bases, produced by an automated se-
quencer. This diagram is called a chromatogram.
being sequenced [96,97]. These programs also produce an estimate of the quality
of each base, in terms of the log-probability that the particular base is incorrect
(qv = −10 log(perror )). Such error estimates have been shown to be relatively ac-
curate [97].
The first attempts at reducing the effects of sequencing errors used these error-rate
estimates when computing read overlaps, thus allowing mis-matches if one or both of
the bases had low qualities, and penalizing mis-matches between high-quality bases.
Most of the early assembly programs [5,64,79,91,92] used this approach and the sim-
ple idea continues to be used in some of the recently developed assemblers [18,22].
Also note that base quality estimates are used by most assembly programs to roughly
identify the high quality portion of each read (called clear range) since sequencing
errors are significantly biased towards the ends of each read. Some assemblers (e.g.,
phrap [91]) perform this step internally, while others require specialized “trimming”
software [98] to remove the poor quality ends of the reads.
Huang [92] was, to our knowledge, first to introduce the idea of using the align-
ment of multiple reads to identify the location of sequencing errors. For each read
r he introduces the notion of an error rate vector which, for each base b, stores
the largest error rate in a section of r bases starting at b as defined by alignments
of read r with all the other reads it overlaps. The error rate vectors are then used
to evaluate the overlaps between reads in order to identify chimeric fragments and
repeat-induced overlaps.
Huang [92] and Kececioglu and Yu [99] were first to address the error correction
problem in the context of a multiple alignment of a collection of mutually overlap-
ping reads. They actually solve a complementary problem, that of separating multiple
non-identical copies of a same repeat using correlated mismatches between the reads
of the multiple alignment. The authors attempt to identify columns in the multiple
alignment where reads disagree. If the disagreements between reads are correlated,
i.e., occur in more than one single read, they refer to the columns as separating
columns (columns 1 and 3 in Fig. 19). The assumption is that if a particular “error”
occurs in a single read it is due to a sequencing error, however, multiple correlated
“errors” indicate the collapse of two or more repeat copies. The assembly algorithm
can thus avoid mis-assembling the repeat by removing the overlaps between reads
SHOTGUN SEQUENCE ASSEMBLY 225
however they gain a considerable reduction in the complexity of the assembly prob-
lem. The task of resolving such complex repeats is left to specialized modules that
use additional sources of information.
It is important to discuss at this moment the task of identifying those reads that
belong to repetitive regions. The techniques of Kececioglu and Yu and Tammi et
al. are greatly helped by such information. The most common methods for repeat
identification use the very nature of the shotgun sequencing project. The reader is
reminded that a shotgun sequencing project starts through the random shearing of
the DNA into a collection of fragments whose ends are then sequenced such that
they over-sample the initial DNA to a specified extent (ranging from 5 to 10 times
for typical sequencing projects). For a bacterial genome each base is thus expected
to appear in 8 reads (corresponding to 8× coverage). The DNA of a repeat is over-
sampled in proportion to the number of copies. One can, therefore, expect that each
base of a two copy repeat would occur in approximately 16 reads. This simple idea
is very effective not only in identifying repeats but also in estimating their specific
copy number [104]. From a theoretical standpoint, the random nature of the initial
shearing process allows the development of statistical tests to identify the repetitive
sequences. Kececioglu and Yu [99] identify collapsed repeats by estimating the prob-
ability of observing a certain depth of coverage at a particular point in the genome.
Myers et al. [8] analyze the arrival rate of fragments, i.e., the distribution of the frag-
ment start points. Thus they compute the log-ratio of the probability that the observed
distribution is representative of a unique, versus the probability that it is representa-
tive of a two-copy collapsed repeat. Please note that such statistical approaches rely
on a random distribution of the fragments being generated by the shotgun process.
Achieving such randomness is difficult in practice leading to limitations in the abil-
ity to correctly identify repeats. As an example, during the finishing stages complex
regions of a genome are sequenced to a higher coverage than the rest, causing the
assembler to incorrectly label them as repetitive.
The methods described so far are not fool-proof, leading to mis-assembled repeats
or to the failure to assemble the reads corresponding to large copy-number repeats
(due to the k-mer frequency thresholding method described above). Statistical ap-
proaches are generally poorly suited to identifying the difference between repeats
with low copy numbers and are confused by skewed fragment distributions due to
an imperfect shotgun process. The error correction techniques that rely on a multiple
alignment of reads to identify errors require a certain amount of coverage in order to
correctly distinguish the sequencing errors (generally, 4 reads are required to confi-
dently identify a distinguishing column). Unfortunately certain classes of repetitive
sequences are under-represented in the shotgun libraries [37] thus escaping detection.
It is, therefore, important for assemblers to rely on additional sources of information
when identifying and correctly assembling repeats. The Euler assembler [60] iden-
228 M. POP
tifies the effect of repeats on the structure of the SBH graph, a situation they call a
tangle (Fig. 14). Identifying such regions that cannot be unambiguously resolved by
the assembler allows the design of specific laboratory experiments meant to provide
the additional information needed for “untangling” the graph [46]. Other assemblers
make use of the “mate-pair” information by linking together reads from opposite
ends of the same fragment. The presence of conflicts in the mate-pair data is usually
a good indication for the existence of a repeat [18], even in the cases when statis-
tical tests are inconclusive. Mate-pairs are thus used to guide the assembly process
[5,8,105] or to identify and repair incorrectly assembled contigs [19,20,64]. Arachne
[19] identifies “weak” regions of the contigs, i.e., regions supported only by fragment
overlaps and not mate-pairs, then breaks such contigs in order to avoid the potential
mis-assembly.
Figure 21 highlights three common scenarios for mis-assemblies caused by re-
peats (represented in different shades of gray in the figure). For each mis-assembly
scenario we indicate in gray those mate-pair relationships that become invalidated.
These can be used as an indicator of mis-assembly. In the case of collapsed tandem
repeats (Fig. 21(a)) mate pairs linking distinct repeat copies become too short, or
force the reads to be incorrectly oriented with respect to each other. The situation
when a collapsed repeat forces the excision of a contig (Fig. 21(b)) leads to mate-
pair links connecting the middle of a contig with another one. Such situations are
handled by the “positive breaking” routine of Arachne. Finally, the rearrangement of
the genome around repeats (Fig. 21(c)) may lead to a lengthening of the mate-pairs
(as shown by the mate-pair a in the figure). This last example also shows one of the
possible pitfalls of using mate-pair data to guide the assembly process. The genome
can be mis-assembled in such a way as to preserve all the mate-pair relationships (as
shown by the links drawn in black). An assembler that uses mate-pairs to guide the
placement of reads may thus inadvertently re-arrange the genome without providing
any evidence of mis-assembly. Note that the last example is not a purely theoretical
one. Such a situation occurs in the assembly of bacteria where ribosomal RNA genes
(3–5 kbp repeats) commonly lead to such rearrangements.
sequence of the DNA molecule corresponds to the well studied problem of multi-
ple sequence alignment (see, for example, [87]). For a set of sequences, the goal is to
identify a “best” multiple alignment under a specified definition of alignment quality.
In the case of sequence assembly the objective function for the multiple alignment in-
volves the consensus sequence, a representation of the DNA being sequenced. This
230 M. POP
problem requires finding a consensus sequence S such that the sum of the distances
(in terms of standard edit-distance measures) between all reads and S is minimized.
In the general case there are no known efficient methods to compute the consensus
sequence. In the case of shotgun sequence assembly, however, the low number of
errors in the sequencing reads implies that most heuristics lead to good solutions.
Most assembly algorithms follow an iterative approach to computing the consen-
sus sequence. Such algorithms start with one of the reads as the consensus, then
iteratively refine this consensus by adding the other reads, one by one, to the already
computed alignment. The multiple alignment problem is, in this fashion, reduced to
the computation of several pair wise alignments, a much simpler task. In the case of
algorithms following the greedy paradigm, the order of the addition of fragments is
naturally defined by the order in which the algorithm examines fragment overlaps. In
fact, in most greedy algorithms [3,5,64,79,90,92] the layout and consensus stages of
the assembler are combined by maintaining a correct consensus for each intermedi-
ate contig. This approach allows the pair wise alignment routine to take into account
the sequence of the already computed consensus. In addition to the consensus, TIGR
Assembler [5] also keeps track of the characteristics of the multiple alignment by
storing for each location in the consensus the profile [106] of all distinct bases that
align to that consensus location. The profile consists of a list of all distinct bases
occurring at that particular column, together with their multiplicities.
Algorithms following the overlap-layout-consensus paradigm start the consen-
sus stage with a rough estimate of the location of each read in the final multiple
alignment. The consensus algorithm can use this information to guide the iterative
alignment procedure [8,18,58,91]. Phrap [91] uses the alignments between reads de-
fined by the greedy stage to define a graph connecting the perfectly matching portions
of these alignments. The final consensus represents a mosaic of the reads obtained
from a maximum weight path in this graph. The algorithm used in AMASS [90]
relies on the low error rates in the fragments to identify columns of the multiple
alignments where all reads agree, then performs a multiple-alignment routine only
in those sections located between exact matches, leading to a very efficient algo-
rithm. Kececioglu and Myers [58] define the consensus computation as the maxi-
mum weight trace problem, by constructing a graph whose edges correspond to
the bases matched by the pair wise overlap stage of the algorithm, and requiring that
the order of the bases, as defined by each individual read, is preserved in the consen-
sus. This problem is NP-complete, thus they propose a sliding window heuristic for
computing the consensus.
These heuristic approaches often produce imperfect alignments, since the quality
of the alignment is affected by the order in which the fragments are added to the
alignment. Anson and Myers [107] propose an iterative algorithm for optimizing
the rough alignments produced by the consensus algorithm. Their approach removes
SHOTGUN SEQUENCE ASSEMBLY 231
one read r at the time from alignment A then realigns r to A − {r} in an attempt to
decrease the number of errors. A similar technique was also used by Huang [92] as
part of the CAP assembler.
As a final step, most consensus algorithms attempt to determine the quality of the
consensus sequence. Such quality estimates are commonly produced for each read by
the base-calling software [96] and Churchill and Waterman [108] propose a statistical
model that combines these error probabilities into an estimate of consensus quality
in each column of the multiple alignment. Their algorithm also provides a method
for deciding on the specific base-call for each consensus base. Previously [2], the
consensus base was computed through a simple majority rule. Bonfield and Staden
[109] suggest a parametric approach to consensus computation that can be tuned to
take into account situations encountered in real data.
All these techniques for assessing the quality of the consensus sequence generated
by the assembler make the assumption that each column in the multiple alignment
corresponds to a unique base in the sequenced DNA strand. In other words, all the
bases in a column must be identical with the exception of differences caused by
sequencing errors. Columns that contain a mixture of bases are considered evidence
of mis-assembly and are therefore assigned a low quality. Such low-quality bases are
usually targeted as part of the finishing process [44,45]. In Section 5.1 we will discuss
a situation in which this basic assumption is not true, specifically the case when the
DNA being sequenced consists of a mixture of two or more highly similar molecules.
4.5 Scaffolding
With the exception of very simple data-sets, assembly programs are unable to cor-
rectly reconstruct the genome as a single contig. The output of an assembler usually
consists of a collection of contigs whose placement along the genome is unknown.
There are three main reasons that lead to the inability of an assembler to join together
all the reads into a single contig:
• The random sampling of DNA fragments from the genome naturally leads to
certain sections not sampled by any sequencing reads. Even at the levels of
coverage selected for typical sequencing projects (5–10×), the probability of
observing such gaps is relatively high, especially when considering that short
overlaps between reads (usually less than about 40 bp) cannot be reliably de-
tected by the assembly software.
• Certain regions of the genome are poorly represented in the fragment libraries
due to purely biological reasons.
• The assembler may not be able to correctly assemble repeats leading to portions
of the genome that either remain un-assembled, or are assembled incorrectly.
232 M. POP
Since the ultimate goal of assembly is to reconstruct, as much as possible, the original
structure of the genome, scientists have developed techniques meant to identify the
correct placement of these genomes along the genome. One such technique, called
scaffolding [34] orders and orients the contigs with respect to each other using the
information contained in the pairing of reads sequenced from opposite ends of a
fragment (see Fig. 5). This technique was used for the first time to guide the assembly
and finishing of Haemophilus influenzae [4], leading to the first complete sequence
of a free-living organism.
In abstract terms, the mate-pair relationship between reads implies a linking of
two contigs with respect to their relative order and orientation. Scaffolding can thus
be extended to take into account other sources of information defining a particular
relative placement of the contigs. Some such sources of information are:
• contig overlaps—ideally the output of an assembler should consist in a collec-
tion of non-overlapping contigs. Sequencing errors, often situated at the end of
reads, lead to contigs that can not be merged by the assembler. These overlaps
can be identified by less stringent alignment algorithms and provide valuable
scaffolding information.
• physical maps—for many genomes scientists map the locations of known
markers along the genome. This is, for example, an essential step in a hierarchi-
cal BAC-by-BAC sequencing project. The location of these markers in the as-
sembled contigs provides information useful in anchoring the contigs to known
positions along the genome [110].
• alignments to a related genome—an increasing number of finished genomes
is becoming available to the scientific community. Thus for many organisms it
is possible to obtain the complete sequence of a closely related organism. The
alignment of the contigs to this reference can thus be used in those cases when
physical maps are not available. This information should, however, be used with
care since genomic rearrangements may lead to an incorrect reconstruction of
the genome.
• gene synteny data—in many organisms certain genes occur in clusters. This
information can be used for scaffolding by identifying, for example, pairs of
contigs that contain genes belonging to a same cluster. While the orientation of
the contigs cannot generally be determined, their spatial proximity is informa-
tion useful in scaffolding.
Please note that some sources of linking data are inherently erroneous and may only
provide an approximate estimate of contig adjacency. As an example, physical maps
provide only a coarse level of detail as the distances between markers can only be
approximately determined. Similarly, gene synteny data provides little information
about the distance between contigs.
SHOTGUN SEQUENCE ASSEMBLY 233
This variety of sources of information can be used to infer the relative placement
of two contigs, yielding a set of abstract links between adjacent contigs. Each link
defines one or more of the following constraints on the relative placement of the two
contigs:
• ordering—one of the contigs occurs “before” the other in the sense of a linear
or circular order corresponding to the location along a linear or circular chro-
mosome;
• orientation—the specification of whether the two contigs represent samples
from the same or from opposite strands of the double-stranded DNA;
• spacing—an indication of the distance between the two contigs.
The scaffolding problem can, therefore, be defined as:
Given a set of contigs and a set of pair wise constraints, identify a linear (circular
in the case of most bacterial genomes) embedding (defining both the order and
the orientation of the contigs along a chromosome) of these contigs such that
most constraints are satisfied.
In the general case this problem is intractable [66,111]. Interestingly, even when re-
laxing the problem by ignoring the ordering of the contigs, the associated contig
orientation problem is also intractable [58], as is the complementary problem of or-
dering the contigs when a proper orientation is given. The orientation problem is
equivalent to finding a maximum bipartite sub-graph, while the ordering problem
is similar to the Optimal Linear Arrangement problem, both of which are NP-hard
[50]. Kececioglu and Myers [58] describe a greedy approximation algorithm to the
orientation problem in the context of sequence assembly that achieves a solution at
most twice worse than the optimal.
All constraints defined above can be described in terms of linear inequalities and,
therefore, the scaffolding problem can be formulated as a constraint satisfiability
problem [19,112]. Due to the complexity of solving such problems (typical solutions
involve many iterations of complex relaxation steps) practical implementations of
this approach are limited to local optimization steps within the scaffolder [8,19,66].
As an example, the scaffolder used by Celera Assembler [8] refines the placement
of the contigs by attempting to minimize the “stretch” of the mate-pair relationships
as defined by the sum of the squares of deviations from the mean fragment size. In
many cases this restricted problem can be easily solved as it reduces to a system of
linear equations.
Most practical solutions to the scaffolding problem use a graph-theoretical ap-
proach. With one exception, the Eulerian graph approach of Pevzner et al. [61], all
scaffolding algorithms to date construct a graph whose nodes correspond to contigs
234 M. POP
and whose edges correspond to the presence of contig links between the correspond-
ing contigs. In order to reduce the effect of errors scaffolders require at least two links
between adjacent contigs. They then “bundle” all links between adjacent contigs into
a single contig edge and greedily join the contigs into scaffolds. The path-merging
algorithm of Huson et al. [111] examines the edges in decreasing order of the number
of links in the bundle. Whenever an edge links two distinct scaffolds, the algorithm
attempts to merge the scaffolds together (hence the name: path-merging). Arachne
[18,19] uses edge weights that depend on both the number of links and the size of
the edge, and Phusion [20] examines edges in order of their lengths, from smallest
to largest. The Bambus scaffolder [113] allows the user to specify the order in which
links are considered in terms of both library size and edge weight. Arachne [19] and
Jazz [22] incorporate an iterative error-correction step during which scaffolds may be
broken then re-combined based on links that were not used during the original greedy
step. Note that Bambus is the only scaffolder that can currently use all the types of
linking information described at the beginning of this section. All other scaffolders
use only the mate-pair information.
When using real data it is difficult to know the correct reconstruction of the
genome, in particular it is generally impossible to know the correct placement of
all reads along the genome. The validation of assemblies based on real shotgun se-
quence data requires one or more of the following types of information:
• an independently verified consensus sequence,
• knowledge of the location of experimentally identified markers,
• the assembly output from a different assembly program,
• high-quality mate-pairing data.
The first two classes of information are useful in the testing of assembly algorithms
during development and in comparing different assembly algorithms. The latter two
classes of information can be used to validate the results of ab-initio sequencing
projects, thus are most useful in the day-to-day use of assembly programs.
When a correct consensus sequence is available, whether artificially generated or
from a finished genome, assessing the quality of an assembly is relatively straight-
forward. Assembly errors manifest themselves as contigs that do not align perfectly
to the consensus (for some examples see Fig. 22). If the reference represents the
sequence of a finished genome it is important to note that the chance exists that the
finished sequence may actually be incorrect.
Certain genetic markers, such as sequence tag sites (STS)12 [67] can be used to
validate the global structure of the assembly when no finished reference is available.
The sequences of the markers are known, thus their locations within the assembly can
be easily identified. This in silico map of the markers is then compared to the map
obtained through laboratory experiments in order to validate the overall structure of
the assembly [110]. Such an approach was used by both Celera [9] and the Human
12 sequence tag sites (STS)—short (200–500 bp) sequences of DNA that occur at a single place within
the genome and whose location in the genome can be mapped experimentally. They serve as landmarks
along the genome.
236 M. POP
Genome Consortium [120] to ascertain the quality of their assemblies of the human
genome. This method provides information only about the large-scale structure of
the assembly and is not able to identify small mis-assemblies. Furthermore, physical
maps often contain errors and should not entirely be relied upon.
The comparison of different assemblies (either different algorithms/programs or
different algorithm parameters) is difficult to interpret. Without additional informa-
tion it is generally difficult to identify which assembly is incorrect, though the com-
parison is an important first-step in identifying regions of the genome that require
further investigation. It is important to note that overall assembly statistics (such as
average and maximum contig and scaffold sizes) are not appropriate measures of
assembly quality. Large contig sizes can easily be achieved at the expense of mis-
assemblies. One should therefore ignore such statistics if not accompanied by an
assessment of the correctness of the assembly.
The validation of assemblies is possible even without independent certificates,
such as mapping data or completed genome sequence. Several characteristics of the
shotgun-sequencing process can be used to detect possible mis-assemblies. In Sec-
tion 4.2 we described how deep fragment coverage can be an indicator of the collapse
of repeats. Conversely, low coverage regions may indicate a mis-assembly due to a
short repeat. These approaches identify deviations from the expected distribution of
fragments produced by a purely random shotgun sequencing process, and are, there-
fore, sensitive to the problems caused by biases in the initial distribution. A more
reliable source of information are the mate-pair relationships between reads. This in-
formation was also recently proposed as a sensitive method of identifying structural
rearrangements in the human genome that are related to various cancers [121]. Note,
however, that many assembly algorithms use the mate-pair information to guide the
assembly process, thus rendering such information of limited use for validation.
An excellent summary of all these methods of assembly validation is presented
by Huson et al. [117] in the context of comparing two assemblies of the same set
of reads. They identify not only tell-tale signs of misassemblies, but also propose
expressive visualization schemes that allow the inspection of large assemblies.
5. Exotic Assembly
Up to this point we have presented solutions to the most common problems re-
lated to shotgun sequence assembly. These algorithms contributed to the current ge-
nomic revolution leading to an exponentially increasing number of genomes being
sequenced. This increase in the numbers and types of genomes that are analyzed is
uncovering new problems to be solved by assembly programs. In this section we will
briefly discuss a few of the current assembly challenges.
SHOTGUN SEQUENCE ASSEMBLY 237
and the five bases surrounding it on either side be of high quality. They show that
this simple approach greatly reduced the number of false positives. Similarly, the
use of base qualities is suggested by Read et al. [126] in the context of identifying
SNPs between two highly similar strains of Bacillus anthracis, the bacterium caus-
ing anthrax. Their technique provides an estimate of the probability of each SNP
being correct by computing the consensus quality for each of the two variants, then
choosing the lower value as the quality of the SNP.
It is often important not only to identify the polymorphic sites, but also determine
which sites belong to the same chromosome, a process called haplotyping. In Sec-
tion 4.2 we discussed how correlated differences between reads can provide enough
information to separate out different copies of a repeat. The same techniques can be
used to separate the two distinct chromosomes, though, in general, the data can only
be partially separated as long regions of similarity between the haplotypes break the
connection between consecutive SNPs. Lancia et al. [127] define this problem in
terms of fragment conflict graphs. This graph represents each read as a node, and
connects two nodes if the corresponding reads disagree at one or more SNP sites.
In the absence of sequencing errors, the fragment conflict graph is bipartite, corre-
sponding to the two haplotypes. In the presence of sequencing errors the graph can
be transformed into a bipartite graph (thereby separating the haplotypes) by either
removing a set of reads, or removing a set of SNPs (effectively marking them as
sequencing errors). Thus they define three optimization problems:
Minimum fragment removal—remove the minimum number of fragments such
that the remaining conflict graph is bipartite
Minimum SNP removal—remove the minimum number of SNP sites such that the
remaining graph is bipartite
Longest haplotype reconstruction—remove a set of fragments such that the sum
of the lengths of the derived haplotypes is maximized.
They proceed to show that all these problems are NP-hard in the general case, how-
ever they can be efficiently solved in the case when the reads do not contain any
gaps, the situation often encountered in practice. Lippert et al. [128] further extend
the results of Lancia et al. by examining the complexities introduced by the possibil-
ity of multiple optimal solutions to the problems described above. Thus they suggest
the need for a better formulation of the separation problem that does not depend on
the choice of an optimal solution from among a number of alternatives.
It is important to discuss the effect of haplotypes on the assembly process itself.
Sequencing errors may lead to inconsistencies in the structure of the read overlaps.
Such inconsistencies are only accentuated by the presence of distinct haplotypes in
the shotgun data, leading to characteristic bubbles (see Fig. 24) in the overlap graph.
Fasulo et al. [129] describe the algorithms used by Celera Assembler that allow the
SHOTGUN SEQUENCE ASSEMBLY 239
F IG . 24. Characteristic “bubble” in the overlap graph (bottom) caused by a SNP. Reads B and C differ
at the SNP location leading to the absence of an overlap edge between them.
assembly of divergent haplotypes into a single consensus sequence. Note that rather
than attempting to separate the different haplotypes they collapse them into a single
multiple alignment, leaving the task of haplotype separation to specialized tools such
as those described above.
Currently there are no widely accepted ways of analyzing and representing haplo-
types other than SNPs. This area of research is of great importance as more organ-
isms are being assembled and we encounter the need to understand complex poly-
morphism events.
number of gaps in the resulting sequence, and to allow the creation of large scaffolds.
However, when a reference genome sequence is available it is conceivable to perform
a considerably lower amount of sequencing (3–5 times over-sampling). The goal in
this case is to generate contigs large enough to allow an unambiguous mapping to
the reference sequence. Not only is the scaffolding problem not an issue in this case,
but also the remaining gaps in the sequence can be quickly closed through direct
sequencing experiments since their location is uniquely identified by the mapping
to the reference. The costs of sequencing a genome can, thus, be greatly reduced,
making feasible the survey sequencing of multiple individuals from the same strain.
To our knowledge, no assembly algorithm fully exploits the comparative informa-
tion, though a few make limited use of this information.
relative distribution of bacteria can not generally be modeled. The uneven level of
coverage also limits the use of standard procedures for repeat and haplotype separa-
tion since these techniques generally require sufficient sequence coverage from all
organisms in order to be effective. New algorithms are therefore necessary to handle
the correct assembly of mixed populations. Comparative assembly techniques will
also be extremely valuable in making use of the sequence data obtained from the
poorly represented members of the population.
6. Conclusions
The assembly problem was repeatedly considered solved, first when efficient ap-
proximation algorithms for the shortest superstring problem became available, again
when assembly software was able to routinely assemble entire bacterial genomes,
and recently when software exists that can assemble entire mammalian genomes in
a relatively short time. Continued reductions in sequencing costs have led to a dra-
matic increase in the numbers of genomes being sequenced. A direct effect of this
242 M. POP
genomic revolution is the uncovering of novel uses for assembly programs, leading
to new algorithmic challenges such as those discussed in Section 5. We therefore
hope that this survey will provide an adequate starting point for those interested in
further exploring the algorithmic and practical problems arising in this dynamic field.
ACKNOWLEDGEMENTS
I would like to thank Art Delcher, Adam Phillippy, and Steven Salzberg for their
useful comments and continued support. This work was supported in part by the
National Institutes of Health under grant R01-LM06845.
R EFERENCES
[1] Sanger F., et al., “Nucleotide sequence of bacteriophage lambda DNA”, J. Mol.
Biol. 162 (4) (1982) 729–773.
[2] Staden R., “Automation of the computer handling of gel reading data produced by the
shotgun method of DNA sequencing”, Nucleic Acids Res. 10 (1982) 4731–4751.
[3] Gingeras T.R., et al., “Computer programs for the assembly of DNA sequences”, Nu-
cleic Acids Res. 7 (2) (1979) 529–545.
[4] Fleischmann R.D., et al., “Whole-genome random sequencing and assembly of
Haemophilus influenzae Rd”, Science 269 (5223) (1995) 496–512.
[5] Sutton G.G., et al., “TIGR assembler: A new tool for assembling large shotgun se-
quencing projects”, Genome Science and Technology 1 (1995) 9–19.
[6] Green P., “Against a whole-genome shotgun”, Genome Res. 7 (5) (1997) 410–417.
[7] Adams M.D., et al., “The genome sequence of Drosophila melanogaster”, Sci-
ence 287 (5461) (2000) 2185–2195.
[8] Myers E.W., et al., “A whole-genome assembly of Drosophila”, Science 287 (5461)
(2000) 2196–2204.
[9] Venter J.C., et al., “The sequence of the human genome”, Science 291 (5507) (2001)
1304–1351.
[10] Consortium I.H.G.S., “Initial sequencing and analysis of the human genome”, Na-
ture 409 (2001) 860–921.
[11] Semple C.A., et al., “Computational comparison of human genomic sequence assem-
blies for a region of chromosome 4”, Genome Res. 12 (3) (2002) 424–429.
[12] Aach J., et al., “Computational comparison of two draft sequences of the human
genome”, Nature 409 (6822) (2001) 856–859.
[13] Adams M.D., et al., “The independence of our genome assemblies”, Proc. Natl. Acad.
Sci. USA 100 (6) (2003) 3025–3026.
[14] Waterston R.H., Lander E.S., Sulston J.E., “More on the sequencing of the human
genome”, Proc. Natl. Acad. Sci. USA 100 (6) (2003) 3022–3024, author reply 3025–
3026.
SHOTGUN SEQUENCE ASSEMBLY 243
[15] Waterston R.H., Lander E.S., Sulston J.E., “On the sequencing of the human genome”,
Proc. Natl. Acad. Sci. USA 99 (6) (2002) 3712–3716.
[16] Green P., “Whole-genome disassembly”, Proc. Natl. Acad. Sci. USA 99 (7) (2002)
4143–4144.
[17] Myers E.W., et al., “On the sequencing and assembly of the human genome”, Proc.
Natl. Acad. Sci. USA 99 (7) (2002) 4145–4146.
[18] Batzoglou S., et al., “ARACHNE: a whole-genome shotgun assembler”, Genome
Res. 12 (1) (2002) 177–189.
[19] Jaffe D.B., et al., “Whole-genome sequence assembly for Mammalian genomes:
arachne 2”, Genome Res. 13 (1) (2003) 91–96.
[20] Mullikin J.C., Ning Z., “The phusion assembler”, Genome Res. 13 (1) (2003) 81–90.
[21] Havlak P., et al., “The Atlas whole-genome assembler”, in: Currents in Computational
Molecular Biology, Montreal, Canada, 2001.
[22] Aparicio S., et al., “Whole-genome shotgun assembly and analysis of the genome of
Fugu rubripes”, Science 297 (5585) (2002) 1301–1310.
[23] Waterston R.H., et al., “Initial sequencing and comparative analysis of the mouse
genome”, Nature 420 (6915) (2002) 520–562.
[24] Consortium R.g.s., “Rat. Genome project”, http://www.hgsc.bcm.tmc.edu/projects/rat.
[25] Kirkness E.F., et al., “The dog genome: survey sequencing and comparative analysis”,
Science 301 (5641) (2003) 1898–1903.
[26] Dehal P., et al., “The draft genome of Ciona intestinalis: insights into chordate and
vertebrate origins”, Science 298 (5601) (2002) 2157–2167.
[27] Green E.D., “Strategies for the systematic sequencing of complex genomes”, Nat. Rev.
Genet. 2 (8) (2001) 573–583.
[28] Cai W.W., et al., “A clone-array pooled shotgun strategy for sequencing large
genomes”, Genome Res. 11 (10) (2001) 1619–1623.
[29] Lander E.S., Waterman M.S., “Genomic mapping by fingerprinting random clones:
A mathematical analysis”, Genomics 2 (3) (1988) 231–239.
[30] Czabarka E., et al., “Algorithms for optimizing production DNA sequencing”, in: Pro-
ceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2000.
[31] Batzoglou S., et al., “Sequencing a genome by walking with clone-end sequences:
A mathematical analysis”, Genome Res. 9 (1999) 1163–1174.
[32] Li X., Waterman M.S., “Estimating the repeat structure and length of DNA sequences
using ell-tuples”, Genome Res. 13 (8) (2003) 1916–1922.
[33] Arratia R., et al., “Genomic mapping by anchoring random clones: A mathematical
analysis”, Genomics 11 (4) (1991) 806–827.
[34] Roach J.C., et al., “Pairwise end sequencing: A unified approach to genomic mapping
and sequencing”, Genomics 26 (1995) 345–353.
[35] Port E., et al., “Genomic mapping by end-characterized random clones: A mathematical
analysis”, Genomics 26 (1) (1995) 84–100.
[36] Yeh R.F., et al., Predicting Progress in Shotgun Sequencing with Paired Ends, 2002.
[37] Chissoe S.L., et al., “Representation of cloned genomic sequences in two sequenc-
ing vectors: correlation of DNA sequence and subclone distribution”, Nucleic Acids
Res. 25 (15) (1997) 2960–2966.
244 M. POP
[38] Mullis K., et al., “Specific enzymatic amplification of DNA in vitro: the polymerase
chain reaction”, Cold Spring Harb. Symp. Quant. Biol. 51 Pt 1 (1986) 263–273.
[39] Burgart L.J., et al., “Multiplex polymerase chain reaction”, Mod. Pathol. 5 (3) (1992)
320–323.
[40] Tettelin H., et al., “Optimized multiplex PCR: efficiently closing a whole-genome shot-
gun sequencing project”, Genomics 62 (3) (1999) 500–507.
[41] Beigel R., et al., “An optimal procedure for gap closing in whole genome shotgun se-
quencing”, in: Proceedings of the Fifth Annual International Conference on Computa-
tional Biology (RECOMB), 2001.
[42] Alon N., et al., “Learning a hidden matching”, in: Proceedings of the IEEE symposium
on Foundations of Computer Science (FOCS), 2002.
[43] Staden R., Judge D.P., Bonfield J.K., “Sequence assembly and finishing methods”,
Methods Biochem. Anal. 43 (2001) 303–322.
[44] Gordon D., Desmarais C., Green P., “Automated finishing with autofinish”, Genome
Res. 11 (4) (2001) 614–625.
[45] Gordon D., Abajian C., Green P., “Consed: A graphical tool for sequence finishing”,
Genome Res. 8 (1998) 195–202.
[46] Mulyukov Z., Pevzner P.A., “EULER-PCR: finishing experiments for repeat resolu-
tion”, Pac. Symp. Biocomput. (2002) 199–210.
[47] Toth G., Gaspari Z., Jurka J., “Microsatellites in different eukaryotic genomes: survey
and analysis”, Genome Res. 10 (7) (2000) 967–981.
[48] Myers E.W., “Toward simplifying and accurately formulating fragment assembly”,
J. Comp. Bio. 2 (2) (1995) 275–290.
[49] “HUGO, Summary of the report of the second international strategy meeting on human
genome sequencing”, http://www.gene.ucl.ac.uk/hugo/bermuda2.html.
[50] Garey M.R., Johnson D.S., Computers and Intractability, W.H. Freeman, New York,
1979.
[51] Blum A., et al., “Linear approximation of shortest superstrings”, in: Proc. 23rd Annual
Symposium on the Theory of Computing, New Orleans, LA, 1991.
[52] Peltola H., et al., “Algorithms for some string matching problems arising in molecular
genetics”, in: Proc. Information Processing, 1983.
[53] Teng S., Yao F.F., “Approximating shortest superstrings”, SIAM J. Computing 26 (2)
(1997) 410–417.
[54] Armen C., Stein C., “A 2 2/3-approximation algorithm for the shortest superstring prob-
lem”, in: Proc. Combinatorial Pattern Matching, 1996.
[55] Kececioglu J.D., Exact and Approximation Algorithms for DNA Sequence Reconstruc-
tion, University of Arizona, 1991.
[56] Smith T.F., Waterman M.S., “Identification of common molecular subsequences”,
J. Mol. Biol. 147 (1) (1981) 195–197.
[57] Peltola H., Soderlund H., Ukkonen E., “SEQAID: a DNA sequence assembling pro-
gram based on a mathematical model”, Nucleic Acids Res. 12 (1) (1984) 307–321.
[58] Kececioglu J.D., Myers E.W., “Combinatorial algorithms for DNA sequence assem-
bly”, Algorithmica 13 (1995) 7–51.
SHOTGUN SEQUENCE ASSEMBLY 245
[59] Idury R.M., Waterman M.S., “A new algorithm for DNA sequence assembly”, J. Comp.
Bio. 2 (2) (1995) 291–306.
[60] Pevzner P.A., Tang H., Waterman M.S., “An Eulerian path approach to DNA fragment
assembly”, Proc. Natl. Acad. Sci. USA 98 (17) (2001) 9748–9753.
[61] Pevzner P.A., Tang H., “Fragment assembly with double-barreled data”, Bioinformat-
ics 17 (Suppl 1) (2001) S225–S233.
[62] Pevzner P.A., Tang H., Waterman M.S., “A new approach to fragment assembly in
DNA sequencing”, in: Proceedings of the Fifth Annual International Conference on
Computational Biology (RECOMB), 2001.
[63] Pop M., Salzberg S.L., Shumway M., “Genome sequence assembly: algorithms and
issues”, IEEE Computer 35 (7) (2002) 47–54.
[64] Huang X., Madan A., “CAP3: A DNA sequence assembly program”, Genome Res. 9
(1999) 868–877.
[65] Bevan M., et al., “Sequence and analysis of the Arabidopsis genome”, Curr. Opin.
Plant. Biol. 4 (2) (2001) 105–110.
[66] Kent W.J., Haussler D., “Assembly of the working draft of the human genome with
GigAssembler”, Genome Res. 11 (9) (2001) 1541–1548.
[67] Olson M., et al., “A common language for physical mapping of the human genome”,
Science 245 (4925) (1989) 1434–1435.
[68] Wang J., et al., “RePS: a sequence assembler that masks exact repeats identified from
the shotgun data”, Genome Res. 12 (5) (2002) 824–831.
[69] Huson D.H., et al., “Design of a compartmentalized shotgun assembler for the human
genome”, Bioinformatics 17 (Suppl 1) (2001) S132–S139.
[70] Csuros M., Milosavljevic A., “Pooled genomic indexing (PGI) mathematical analysis
and experiment design”, in: Proceedings of the 2nd International Workshop on Algo-
rithms in Bioinformatics (WABI), Springer-Verlag, 2002.
[71] Parsons R.J., Forrest S., Burks C., “Genetic algorithms, operators, and DNA fragment
assembly”, Machine Learning 21 (1995) 11–33.
[72] Goldberg M.K., Lim D.T., “A learning algorithm for the shortest superstring problem”,
in: Proceedings of the Atlantic Symposium on Computational Biology and Genome
Information Systems and Technology, Durham, NC, 2001.
[73] Goldberg M.K., Lim D.T., “Designing and testing a new DNA fragment assembler
VEDA-2”, http://www.cs.rpi.edu/~goldberg/publications/veda-2.pdf.
[74] Jiang T., Li M., “DNA sequencing and string learning”, Math. Sys. Theory 29 (1996)
387–405.
[75] King L.M., Cummings M.P., “Satellite DNA repeat sequence variation is low in three
species of burying beetles in the genus Nicrophorus (Coleoptera: Silphidae)”, Mol. Biol.
Evol. 14 (11) (1997) 1088–1095.
[76] Kosaraju S.R., Delcher A., “Large-scale assembly of DNA strings and space-efficient
construction of suffix trees(Correction)”, in: Proceedings of the 28th Annual ACM Sym-
posium on Theory of Computing, STOC’96, 1996.
[77] Kosaraju S.R., Delcher A.L., “Large-scale assembly of DNA strings and space-efficient
construction of suffix trees”, in: Proceedings of the ACM Symposium on the Theory of
Computing, STOC’95, 1995.
246 M. POP
[78] Chen T., Skiena S.S., “Trie-based data structures for sequence assembly”, in: Proceed-
ings of the Eighth Symposium on Combinatorial Pattern Matching, 1997.
[79] Chen T., Skiena S.S., “A case study in genome-level fragment assembly”, Bioinformat-
ics 16 (2000) 494–500.
[80] Weiner P., “Linear pattern matching algorithms”, in: Proceedings of the 14th IEEE
Symposium on Switching and Automata Theory, 1973.
[81] Manber U., Myers E.W., “Suffix arrays: A new method for on-line string searches”,
SIAM J. Computing 22 (1993) 935–948.
[82] Ukkonen E., “On-line construction of suffix-trees”, Algorithmica 14 (1995) 249–260.
[83] McCreight E.M., “A space-economical suffix tree construction algorithm”, J. ACM 23
(1976) 262–272.
[84] Needleman S.B., Wunsch C.D., “A general method applicable to the search for similar-
ities in the amino acid sequence of two proteins”, J. Mol. Biol. 48 (1970) 443–453.
[85] Myers E.W., Miller W., “Optimal alignments in linear space”, CABIOS 4 (1988) 11–17.
[86] Myers E.W., “An O(nd) difference algorithm and its variations”, Algorithmica 1 (1986)
251–266.
[87] Gusfield D., Algorithms on Strings, Trees, and Sequences, The Press Syndicate of the
University of Cambridge, 1997.
[88] Altschul S.F., et al., “Basic local alignment search tool”, J. Mol. Biol. 215 (1990) 403–
410.
[89] Pearson W.R., Lipman D.J., “Improved tools for biological sequence comparison”,
Proc. Natl. Acad. Sci. USA 85 (1988) 2444–2448.
[90] Kim S., Segre A.M., “AMASS: A structured pattern matching approach to shotgun
sequence assembly”, J. Comp. Bio. 6 (2) (1999) 163–186.
[91] Green P., PHRAP documentation: ALGORITHMS, 1994.
[92] Huang X., “An improved sequence assembly program”, Genomics 33 (1996) 21–31.
[93] Tammi M.T., et al., “Correcting errors in shotgun sequences”, Nucleic Acids
Res. 31 (15) (2003) 4663–4672.
[94] Roberts M., Hunt B.R., Yorke J.A., Bolanos R., Delcher A., “A preprocessor for shot-
gun assembly of large genomes”, J. Comp. Bio., submitted for publication.
[95] Huang X., et al., “PCAP: A whole-genome assembly program”, Genome Res. 13 (9)
(2003) 2164–2170.
[96] Ewing B., Green P., “Base-calling of automated sequencer traces using phred. II. Error
probabilities”, Genome Res. 8 (3) (1998) 186–194.
[97] Ewing B., et al., “Base-calling of automated sequencer traces using phred. I. Accuracy
assessment”, Genome Res. 8 (3) (1998) 175–185.
[98] Chou H.H., Holmes M.H., “DNA sequence quality trimming and vector removal”,
Bioinformatics 17 (12) (2001) 1093–1104.
[99] Kececioglu J., Yu J., “Separating repeats in DNA sequence assembly”, in: Proceedings
of the Fifth Annual International Conference on Computational Biology (RECOMB),
Montreal, Canada, 2001.
[100] Myers E.W., “Optimally separating sequences”, Genome Informatics 12 (2001) 165–
174.
SHOTGUN SEQUENCE ASSEMBLY 247
[101] Tammi M.T., et al., “Separation of nearly identical repeats in shotgun assemblies using
defined nucleotide positions, DNPs”, Bioinformatics 18 (3) (2002) 379–388.
[102] Tammi M.T., Arner E., Andersson B., “TRAP: Tandem Repeat Assembly Program pro-
duces improved shotgun assemblies of repetitive sequences”, Comput. Methods Pro-
grams Biomed. 70 (1) (2003) 47–59.
[103] Pe’er I., Shamir R., “Spectrum alignment: Efficient resequencing by hybridization”, in:
Proceedings of the Eighth International Conference on Intelligent Systems for Molecu-
lar Biology (ISMB), 2000.
[104] Bailey J.A., et al., “Segmental duplications: organization and impact within the current
human genome project assembly”, Genome Res. 11 (2001) 1005–1017.
[105] Pevzner P., Tang H., “Fragment assembly with double barreled data”, in: ISMB’01,
2001.
[106] Gribskov M., McLachlan A.D., Eisenberg D., “Profile analysis: Detection of distantly
related proteins”, Proc. Natl. Acad. Sci. USA 84 (1987) 4355–4358.
[107] Anson E.L., Myers E.W., “ReAligner: A program for refining DNA sequence multi-
alignments”, in: RECOMB’97, 1997.
[108] Churchill G.A., Waterman M.S., “The accuracy of DNA sequences: estimating se-
quence quality”, Genomics 14 (1) (1992) 89–98.
[109] Bonfield J.K., Staden R., “The application of numerical estimates of base calling accu-
racy to DNA sequencing projects”, Nucleic Acids Res. 23 (8) (1995) 1406–1410.
[110] Zhou S., et al., “Whole-genome shotgun optical mapping of rhodobacter sphaeroides
strain 2.4.1 and its use for whole-genome shotgun sequence assembly”, Genome
Res. 13 (9) (2003) 2142–2151.
[111] Huson D.H., Reinert K., Myers E., “The greedy path-merging algorithm for sequence
assembly”, in: Proceedings of the Fifth Annual International Conference on Computa-
tional Biology (RECOMB), 2001.
[112] Thayer E.C., Olson M.V., Karp R.M., “Error checking and graphical representation
of multiple-complete-digest (MCD) restriction-fragment maps”, Genome Res. 9 (1)
(1999) 79–90.
[113] Pop M., Kosack D., Salzberg S.L., “Hierarchical scaffolding with bambus”, Genome
Res. 14 (1) (2004) 149–159.
[114] Kim S., Liao L., Tomb J.F., “A probabilistic approach to sequence assembly validation”,
in: Workshop on Data Mining in Bioinformatics, 2001.
[115] Seto D., Koop B.F., Hood L., “An experimentally derived data set constructed for test-
ing large-scale DNA sequence assembly algorithms”, Genomics 15 (1993) 673–676.
[116] Miller M.J., Powell J.I., “A quantitative comparison of DNA sequence assembly pro-
grams”, J. Comp. Bio. 1 (1994) 257–269.
[117] Huson D.H., et al., “Comparing assemblies using fragments and mate-pairs”, in: Work-
shop on Algorithms in Bioinformatics, Springer-Verlag, 2001.
[118] Engle M.L., Burks C., “Artificially generated data sets for testing DNA sequence as-
sembly algorithms”, Genomics 16 (1993) 286–288.
[119] Myers G., “A dataset generator for whole genome shotgun sequencing”, in: Proc. Int.
Conf. Intell. Syst. Mol. Biol., 1999, pp. 202–210.
248 M. POP
[120] Lander E.S., et al., “Initial sequencing and analysis of the human genome”, Na-
ture 409 (6822) (2001) 860–921.
[121] Volik S., et al., “End-sequence profiling: Sequence-based analysis of aberrant
genomes”, Proc. Natl. Acad. Sci. USA (2003).
[122] Taillon-Miller P., et al., “Overlapping genomic sequences: a treasure trove of single-
nucleotide polymorphisms”, Genome Res. 8 (7) (1998) 748–754.
[123] Altshuler D., et al., “An SNP map of the human genome generated by reduced repre-
sentation shotgun sequencing”, Nature 407 (6803) (2000) 513–516.
[124] Mullikin J.C., et al., “An SNP map of human chromosome 22”, Nature 407 (6803)
(2000) 516–520.
[125] Dawson E., et al., “A SNP resource for human chromosome 22: extracting dense clus-
ters of SNPs from the genomic sequence”, Genome Res. 11 (1) (2001) 170–178.
[126] Read T.D., et al., “Comparative genome sequencing for discovery of novel polymor-
phisms in Bacillus anthracis”, Science 296 (5575) (2002) 2028–2033.
[127] Lancia G., et al., “SNPs problems, complexity and algorithms”, in: 9th Annual Euro-
pean Symposium on Algorithms (BRICS), University of Aarhus, Denmark, 2001.
[128] Lippert R., et al., “Algorithmic strategies for the single nucleotide polymorphism hap-
lotype assembly problem”, Briefings in Bioinformatics 3 (1) (2002) 23–31.
[129] Fasulo D., et al., “Efficiently detecting polymorphisms during the fragment assembly
process”, Bioinformatics 18 (Suppl.1) (2002) S294–S302.
[130] Casjens S., et al., “A bacterial genome in flux: the twelve linear and nine circular ex-
trachromosomal DNAs in an infectious isolate of the Lyme disease spirochete Borrelia
burgdorferi”, Mol. Microbiol. 35 (3) (2000) 490–516.
[131] Beja O., et al., “Unsuspected diversity among marine aerobic anoxygenic phototrophs”,
Nature 415 (6872) (2002) 630–633.
[132] Randazzo C.L., et al., “Diversity, dynamics, and activity of bacterial communities dur-
ing production of an artisanal Sicilian cheese as evaluated by 16S rRNA analysis”,
Appl. Environ. Microbiol. 68 (4) (2002) 1882–1892.
[133] Pearson H., “Body’s bugs to be sequenced”, in: Nature Science Update, 2003.
[134] Whitfield J., “Genome pioneer sets sights on Sargasso Sea”, in: Nature Science Update,
2003.
[135] Kececioglu J.D., Li M., Tromp J., “Inferring a DNA sequence from erroneous copies”,
Theoretical Computer Science 185 (1) (1997) 3–13.
[136] Liang F., et al., “An optimized protocol for analysis of EST sequences”, Nucleic Acids
Res. 28 (18) (2000) 3657–3665.
[137] Kent W.J., Haussler D., GigAssembler: An Algorithm for the Initial Assembly of the
Human Genome Draft, University of California Santa Cruz, 2000.
Advances in Large Vocabulary Continuous
Speech Recognition
Abstract
The development of robust, accurate and efficient speech recognition systems is
critical to the widespread adoption of a large number of commercial applications.
These include automated customer service, broadcast news transcription and in-
dexing, voice-activated automobile accessories, large-vocabulary voice-activated
cell-phone dialing, and automated directory assistance. This article provides a re-
view of the current state-of-the-art, and the recent research performed in pursuit
of these goals.
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
2. Front End Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
2.1. Mel Frequency Cepstral Coefficients . . . . . . . . . . . . . . . . . . . . . . . 252
2.2. Perceptual Linear Predictive Coefficients . . . . . . . . . . . . . . . . . . . . . 254
2.3. Discriminative Feature Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
3. The Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
3.1. Hidden Markov Model Framework . . . . . . . . . . . . . . . . . . . . . . . . 256
3.2. Acoustic Context Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
3.3. Gaussian Mixture State Models . . . . . . . . . . . . . . . . . . . . . . . . . . 259
3.4. Maximum Likelihood Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
4. Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
4.1. Finite State Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
4.2. N -gram Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
5. Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
5.1. The Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.2. Multipass Lattice Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
1. Introduction
Over the course of the past decade, automatic speech recognition technology has
advanced to the point where a number of commercial applications are now widely
deployed and successful: systems for name-dialing [84,26], travel reservations [11,
72], getting weather-information [97], accessing financial accounts [16], automated
directory assistance [41], and dictation [86,9,78] are all in current use. The fact that
these systems work for thousands of people on a daily basis is an impressive tes-
timony to technological advance in this area, and it is the aim of this article to de-
scribe the technical underpinnings of these systems and the recent advances that have
made them possible. It must be noted, however, that even though the technology
has matured to the point of commercial usefulness, the problem of large vocabulary
continuous speech recognition (LVCSR) is by no means solved: background noise,
corruption by cell-phone or other transmission channels, unexpected shifts in topic,
foreign accents, and overly casual speech can all cause automated systems to fail.
Thus, where appropriate, we will indicate the shortcomings of current technology,
and suggest areas of future research. Although this article aims for a fairly compre-
hensive coverage of today’s speech recognition systems, a vast amount of work has
been done in this area, and some limitation is necessary. Therefore, this review will
focus primarily on techniques that have proven successful to the point where they
have been widely adopted in competition-grade systems such as [78,36,37,58,27,93].
The cornerstone of all current state-of-the-art speech recognition systems is the
Hidden Markov Model (HMM) [6,43,54,74]. In the context of HMMs, the speech
recognition problem is decomposed as follows. Speech is broken into a sequence
of acoustic observations or frames, each accounting for around 25 milliseconds of
speech; taken together, these frames comprise the acoustics a associated with an
utterance. The goal of the recognizer is to find the likeliest sequence of words w
given the acoustics:
arg max P (w|a).
w
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION 251
speech. Algorithmically, the steps involved in both methods are approximately the
same, though the motivations and details are different. In both cases, the algorithmic
process is as follows:
(1) compute the power spectrum of the frame,
(2) warp the frequency range of the spectrum so that the high-frequency range is
compressed,
(3) compress the amplitude of the spectrum,
(4) decorrelate the elements of the spectral representation by performing an in-
verse DFT—resulting in a cepstral representation.
Empirical studies have shown that recognition performance can be further en-
hanced with the inclusion of features computed not just from a single frame, but
from several surrounding frames as well. One way of doing this is to augment the fea-
ture vectors with the first and second temporal derivatives of the cepstral coefficients
[22]. More recently, however, researchers have applied linear discriminant analysis
[19] and related transforms to project a concatenated sequence of feature vectors into
a low-dimensional space in which phonetic classes are well separated. The following
subsections will address MFCCs, PLP features, and discriminant transforms in detail.
the original signal will be removed through mean-subtraction, due to the logarithmic
nonlinearity. Therefore, mean-subtraction is standard.
more sensitive to spectral peaks, and smooths low-energy regions. In the original
implementation of [38], a fifth-order autoregressive model was used; subsequent im-
plementations use a higher order model, e.g., 12 as in [46].
1 1
µ= xi , Σ= (xi − µ)(xi − µ)T .
N N
i i
Next, the total within class variance W is computed
1
W= Nj Σj .
N
j
Using θ to denote the LDA transformation matrix, the LDA objective function is
given by:
|θ T Σθ |
θ̂ = arg max ,
θ |θ T W θ |
and the optimal transform is given by the top eigenvectors of W −1 Σ.
While LDA finds a projection that tends to maximize relative interclass distances,
it makes two questionable assumptions: first, that the classes are modeled by a full
covariance Gaussian in the transformed space, and second that the covariances of
256 G. ZWEIG AND M. PICHENY
all transformed classes are identical. The first assumption is problematic because, as
discussed in Section 3.3, full covariance Gaussians are rarely used; but the extent
to which the first assumption is violated can be alleviated by applying a subsequent
transformation meant to minimize the loss in likelihood between the use of full and
diagonal covariance Gaussians [31]. The MLLT transform developed in [31] applies
the transform ψ that minimizes
Nj logdiag ψΣj ψ T − logψΣj ψ T
j
and has been empirically found to be quite effective in conjunction with LDA [77].
To address the assumption of equal covariances, [77] proposes the maximization
of
|θ Σθ T | Nj
|θ Σj θ T |
j
and presents favorable results when used in combination with MLLT. A closely re-
lated technique, HLDA, [50] relates projective discriminant analysis to maximum
likelihood training, where the unused dimensions are modeled with a shared covari-
ance. This form of analysis may be used both with and without the constraint that
the classes be modeled by a diagonal covariance model in the projected space, and
has also been widely adopted. Combined, LDA and MLLT provide on the order of a
10% relative reduction in word-error rate [77] over simple temporal derivatives.
F IG . 3. A simple HMM representing the state sequence of three words. Adding an arc from the final
state back to the start state would allow repetition.
Note that in the HMM framework, each acoustic vector is associated with a spe-
cific state in the HMM. Thus, a sequence of n acoustic vectors will correspond to
a sequence of n consecutive states. We will denote a specific sequence of states
s1 = a, s2 = b, s3 = c, . . . , sn = k by s. In addition to normal emitting states, it
is often convenient to use “null” states, which do not emit acoustic observations.
In particular, we will assume that the HMM starts at time t = 0 in a special null
start-state α, and that all paths must end in a special null final-state ω at t = N + 1.
In general, having a specific word hypothesis w will be compatible with only some
state sequences, s, and not with others. It is necessary, therefore, to constrain sums
over state sequences to those sequences that are compatible with a given word se-
quence; we will not, however, introduce special notation to make this explicit. With
this background, the overall probability is factored as follows:
P (a|w) = P (a|s)P (s|w) = bst (ot )ast st−1 .
s s t =1,...,n
Figure 3 illustrates a simple HMM that represents the state sequences of three
words.
The following sections describe the components of the HMM acoustic model in
more detail. Section 3.2 will focus on the mapping from words to states that is nec-
essary to determine P (s|w). Section 3.3 discusses the Gaussian mixture models that
are typically used to model bj (o). The transition probabilities can be represented
in a simple table, and no further discussion is warranted. The section closes with a
description of the training algorithms used for parameter estimation.
barge | B AA R JH
tomato | T AH M EY T OW
tomato | T AH M AA T OW
(1) Word-internal triphones. A phone and its immediate neighbors to the left and
right. However, special units are used at the beginnings and endings of words
so that context does not persist across word boundaries.
(2) Cross-word triphones. The same as above, except that context persists across
word boundaries, resulting in better coarticulation modeling.
(3) Cross-word quinphones. A phone and its two neighbors to the left and right.
(4) A phone, and all the other phones in the same word.
(5) A phone, all the other phones in the same word, and all phones in the preced-
ing word.
1. Create a record for each frame that includes the frame and the phonetic context
associated with it.
2. Model the frames associated with a node with a single diagonal-covariance
Gaussian. The frames associated with a node will have a likelihood according to
this model.
3. For each yes/no question based on the context window, compute the likelihood that
would result from partitioning the examples according to the induced split.
4. Split the frames in the node using the question that results in the greatest likelihood
gain, and recursively process the resulting two nodes.
One example of this is EMLLT [70], in which the inverse covariance matrix of each
Gaussian j is modeled as the sum of basis matrices. First, a set of d dimensional
basis vectors al is defined. Then inverse covariances are modeled as:
D
j
Σj−1 = λl al aTl .
l=1
This discussion has avoided a number of subtleties that arise in practice, but are
not central to the ideas. Specifically, when multiple observation streams are avail-
able, an extra summation must be added outside all others in the reestimation for-
mulae. Also, observation probabilities are tied across multiple states—the same “ae”
acoustic model may be used in multiple HMM states. This entails creating summary
statistics for each acoustic model by summing the statistics of all the states that use it.
Finally, in HMMs with extensive null states, the recursions and reestimation formu-
lae must be modified to reflect the spontaneous propagation of probabilities through
chains of null states.
Using the training data D to approximate the sum over all words and acoustics, we
can represent the mutual information as
Pθ (a, w) Pθ (w)Pθ (a|w) Pθ (a|w)
log = log = log
Pθ (w)Pθ (a) Pθ (w)Pθ (a) Pθ (a)
D D D
Pθ (a|w)
= log ′ ′
.
D w′ Pθ (w )Pθ (a|w )
262 G. ZWEIG AND M. PICHENY
If we assume that the language model determining Pθ (w) is constant (as is the case
in acoustic model training) then this is identical to optimizing the posterior word
probability:
Pθ (a|w)Pθ (w)
log Pθ (w|a) = log ′ ′
.
D D w′ Pθ (w )Pθ (a|w )
Before describing MMI training in detail, we note that the procedure that will
emerge is not much different from training an ML system. Procedurally, one first
computes the state-occupancy probabilities and first and second order statistics ex-
actly as for a ML system. This involves summing path posteriors over all HMM
paths that are consistent with the known word hypotheses. One then repeats exactly
the same process, but sums over all HMM paths without regard to the transcripts.
The two sets of statistics are then combined in a simple update procedure. For histor-
ical reasons, the first set of statistics is referred to as “numerator” statistics and the
second (unconstrained) set as “denominator” statistics.
An effective method for performing MMI optimization was first developed in [30]
for the case of discrete hidden Markov models. The procedure of [30] works in gen-
eral to improve objective functions R(θ ) that are expressible as
s1 (θ )
R(θ ) =
s2 (θ )
with s1 and s2 being polynomials with s2 > 0. Further, for each individual
probability
distribution λ under adjustment, it must be the case that λi 0 and i λi = 1. In this
case, it is proved that the parameter update
λi ∂ log∂λR(λ)
i
+ D
λ̂i = ∂ log R(λ)
k λ k ∂λk +D
is guaranteed to increase the objective function, with a large enough value of the
constant D. In the case of discrete variables, it is shown that
∂ log R(λ) 1 num
= C − Cλden
∂λi λi λi i
where λi is probability of event associated with λi being true, and Cλi is count of
times this event occurred, as computed from the α–β recursions of the previous
section.
Later work [68,92], extended these updates to Gaussian means and variances, and
[92] did extensive work to determine appropriate values of D for large vocabulary
speech recognition. For state j , mixture component m, let S(x) denote the first order
statistics, S(x 2 ) denote the second order statistics, and C denote the count of the
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION 263
Sjnum 2 den 2 2 2
m (x ) − Sj m (x ) + D(σj m + µj m )
σ̂j2m = − µ̂2j m .
Cjnum den
m − Cj m + D
For the mixture weights, let fj m be the mixture coefficient associated with mixture
component m of state j . Then
fj m ∂ log R(λ)
∂fjm +D
fˆj m = ∂ log R(λ)
k fj k ∂fjk +D
with
∂ log R(λ) 1 num
= Cj k − Cjden
k .
∂fj k fj k
Several alternative ways for reestimating the mixture weights are given in [92].
MMI has been found to give a 5–10% relative improvement in large vocabulary
tasks [92], though the advantage diminishes as systems with larger numbers of Gaus-
sians are used [58]. The main disadvantage of MMI training is that the denominator
statistics must be computed over all possible paths. This requires either doing a full
decoding of the training data at each iteration, or the computation of lattices (see
Section 5.2). Both options are computationally expensive unless an efficiently writ-
ten decoder is available.
4. Language Model
At a slightly higher level, Backus Naur Form [64] is often used for more elaborate
grammars with replacement patterns. For example,
In fact, BNF is able to represent context free grammars [13]—a broad class of gram-
mars in which recursive rule definitions allow the recognition of some strings that
cannot be represented with regular expressions. However, in comparison with regu-
lar expressions, context-free grammars have had relatively little affect on ASR, and
will not be discussed further.
Many of the tools and conventions associated with regular expressions were devel-
oped in the context of computer language compilers, in which texts (programs) were
either syntactically correct or not. In this context, there is no need for a notion of
how correct a string is, or alternatively what the probability of it being generated by
a speaker of the language is. Recall, however, that in the context of ASR, we are in-
terested in P (w), the probability of a word sequence. This can easily be incorporated
in to the regular expression framework, simply by assigning costs or probabilities to
the rules in the grammar.
Grammars are frequently used in practical dialog applications, where develop-
ers have the freedom to design system prompts and then specify a grammar that is
expected to handle all reasonable replies. For example, in an airline-reservation ap-
plication the system might ask “Where do you want to fly to?” and then activate
a grammar designed to recognize city names. Due to their simplicity and intuitive
nature, these sorts of grammars are the first choice wherever possible.
ence, and pragmatic relevance—in practice, researchers have been unable to signifi-
cantly improve on it.
A typical large vocabulary system will recognize between 30 and 60 thousand
words, and use a 3 or 4-gram language model trained on around 200 million words
[78]. While 200 million words seems at first to be quite large, in fact for a 3-gram LM
with a 30,000 word vocabulary, it is actually quite small compared to the 27 × 1012
distinct trigrams that need to be represented. In order to deal with this problem of data
sparsity, a great deal of effort has been spent of developing techniques for reliably
estimating the probabilities of rare events.
4.2.1 Smoothing
Smoothing is perhaps the most important practical detail in building N -gram
language models, and these techniques fall broadly into three categories: additive
smoothing, backoff models, and interpolated models. The following sections touch
briefly on each, giving a full description for only interpolated LMs, which have been
empirically found to give good performance on a variety of tasks. The interested
reader can find a full review of all these methods in [12].
i
i−1 c(wi−n+1 )
P wi |wi−n+1 = i−1
.
c(wi−n+1 )
The problem, of course, is that for high-order N -gram models, many of the pos-
sible (and perfectly normal) word sequences in a language will not be seen, and
thus assigned zero-probability. This is extraordinarily harmful to a speech recogni-
tion system, as one that uses such a model will never be able to decode these novel
word sequences. One of the simplest ways of dealing with such a problem is to use
a set of fictitious or imaginary counts to encode our prior knowledge that all word
sequences have some likelihood. In the most basic implementation [42], one simply
adds a constant amount δ to each possible event. For a vocabulary of size |V |, one
then has:
i
i−1 δ + c(wi−n+1 )
P wi |wi−n+1 = i−1
.
δ|V | + c(wi−n+1 )
266 G. ZWEIG AND M. PICHENY
and
i−1 i−1
Nk+ wi−n+1 • = wi : c wi−n+1 wi k .
The modified Kneser Ney estimate is then given as
i i
i−1 c(wi−n+1 ) − D(c(wi−n+1 )) i−1 i−1
P wi |wi−n+1 = i−1
+ γ wi−n+1 P wi |wi−n+2 .
c(wi−n+1 )
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION 267
Defining
n1
Y=
n1 + 2n2
where nr is the number of n-grams that occur exactly r times, the discounting factors
are given by
0 if c = 0,
n2
1 − 2Y n if c = 1,
1
D(c) = n3
2 − 3Y n2 if c = 2,
3 − 4Y n4 if c 3.
n3
The backoff weights are determined by
i−1 i−1 i−1
i−1 D1 N1 (wi−n+1 •) + D2 N2 (wi−n+1 •) + D3+ N3+ (wi−n+1 •)
γ wi−n+1 = i−1
.
c(wi−n+1 )
This model has been found to slightly outperform most other models and is in use
in state-of-the-art systems [78]. Because D(0) = 0, this can also be expressed in a
backoff form.
i−1
probabilities: Pk (wi |wi−n+1 ). These models are then combined with weighting fac-
tors λk :
i−1 i−1
P wi |wi−n+1 = Pk wi |wi−n+1 , λk = 1.
k k
For example, in a recent conversational telephony system [78] an interpolation of
data gathered from the web, broadcast news data, and two sources of conversational
data (with weighting factors 0.4, 0.2, 0.2, and 0.2 respectively) resulted in about
a 10% relative improvement over using the largest single source of conversational
training data.
best viewed in terms of a sequence of finite state transductions. In this model, one
begins with a finite state encoding of the language model, but represents the expan-
sion at each level—from word to pronunciation, pronunciation to phone, and phone
to state—as the composition of the previous representation with a finite state trans-
ducer. The potential advantage of this approach is a consistent representation of each
form of expansion, with the actual operations being performed by a single compo-
sition function. In practice, care must be taken to ensure that the composition oper-
ations do not use large amounts of memory, and in some cases, it is inconvenient to
express the acoustic context model in the form of a transducer (e.g., when long span
context models are used).
In some ways, the most important advantage of finite-state representations is that
operations of determinization and minimization were recently developed by [59,60].
Classical algorithms were developed in the 1970s [1] for unweighted graphs as found
in compilers, but the extension to weighted graphs (the weights being the language
model and transition probabilities) has made these techniques relevant to speech
recognition. While it is beyond the scope of this paper to present the algorithms
for determinization and minimization, we briefly describe the properties.
A graph is said to be deterministic if each outgoing arc from a given state has a
unique label. In the context of speech recognition graphs, the arcs are labeled with
either HMM states, or word, pronunciation, or phone labels. While the process of
taking a graph and finding an equivalent deterministic one is well defined, the de-
terministic representation can in pathological cases grow exponentially in the num-
ber of states of the input graph. In practice, this rarely happens, but the graph does
270 G. ZWEIG AND M. PICHENY
grow. The benefit actually derives from the specific procedures used to implement
the Viterbi search described in Section 5.1. Suppose one has identified a fixed num-
ber w of states that are reasonably likely at a given time t. Only a small number k
of HMM states are likely to have good acoustic matches, and thus to lead to likely
states at time t + 1. Thus, if on average z outgoing arcs per state are labeled with a
given HMM state, the number of likely states at t + 1 will be on the order of zkw.
By using a deterministic graph, z is limited to 1, and thus tends to decrease the num-
ber of states that will ever be deemed likely. In practice, this property can lead to an
order-of-magnitude speedup in search time, and makes determinization critical.
One can also ask, given a deterministic graph, what is the smallest equivalent
deterministic graph. The process of minimization [59] produces such a graph, and in
practice often reduces graph sizes by a factor of two or three.
4.2.4 Pruning
Modern corpus collections [33] often contain an extremely large amount of data—
between 100 million and a billion words. Given that N -gram language models can
backoff to lower-order statistics when high-order statistics are unavailable, and that
representing extremely large language models can be disadvantageous from the
point-of-view of speed and efficiency, it is natural to ask how one can trade off lan-
guage model size and fidelity. Probably the simplest way of doing this is to impose
a count threshold, and then to use a lower-order backoff estimate for the probability
of the nth word in such N -grams.
A somewhat more sophisticated approach [80] looks at the loss in likelihood
caused by using the backoff estimate to select N -grams to prune. Using P and P ′
to denote the original and backed-off estimates, and N(·) to represent the (possibly
discounted) number of times an N -gram occurs, the loss in log likelihood caused by
i
the omission of an N -gram wi−n+1 is given by:
i i−1 i−1
N wi−n+1 log P wi |wi−n+1 − log P wi |wi−n+2 .
In the “Weighted Difference Method” [80], one computes all these differences, and
removes the N -grams whose difference falls below a threshold. A related approach
[82] uses the Kullback–Leibler distance between the original and pruned language
models to decide which N -grams to prune. The contribution of an N -gram in the
original model to this KL distance is given by:
i i−1 i−1
P wi−n+1 log P wi |wi−n+1 − log P wi |wi−n+2
and the total KL distance is found by summing over all N -grams in the original
model. The algorithm of [82] works in batch mode, first computing the change in
relative entropy that would result from removing each N -gram, and then removing
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION 271
all those below a threshold, and recomputing backoff weights. A comparison of the
weighted-difference and relative-entropy approaches shows that the two criteria are
the same in form, and the difference between the two approaches is primarily in the
recomputation of backoff weights that is done in [82]. In practice, LM pruning can
be extremely useful in limiting the size of a language model in compute-intensive
tasks.
able to combine syntactic and semantic information. For example, [8] presents a class
composed of feet miles pounds degrees inches barrels tons acres meters bytes and
many similar classes whose members are similar both syntactically and semantically.
Later work [66] extends the class-based model to the case where a word may
map into multiple classes, and a general mapping function S(·) is used to map a
i−1
word history wi−n+1 into a specific equivalence class s. Under these more general
assumptions, we have
i−1 i−1
P wi |wi−n+1 = P (wi |ci ) P (ci |s)P s|wi−n+1 .
ci s
5. Search
Recall that the objective of a decoder is to find the best word sequence w∗ given
the acoustics:
P (w)P (a|w)
w∗ = arg max P (w|a) = arg max .
w w P (a)
The crux of this problem is that with a vocabulary size V and utterance length N ,
the number of possible word-sequences is O(V N ), i.e., it grows exponentially in the
utterance length. Over the years, the process of finding this word sequence has been
one of the most studied aspects of speech recognition with numerous techniques and
variations developed, [29,69,2]. Interestingly, in recent years, there has been a renais-
sance of interest in the simplest of these decoding algorithms: the Viterbi procedure.
The development of better HMM compilation techniques along with faster comput-
ers has made Viterbi applicable to both large vocabulary recognition and constrained
tasks, and therefore this section will focus on Viterbi alone.
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION 273
Remarkably, due to the limited-history property of HMMs, this can be done with an
extremely simple algorithm [54,73]. We define
(1) δt (j ): the cost of the best path ending in state j at time t,
(2) Ψt (j ): the state preceding state j on the best path ending in state t at time t,
(3) pred(s): the set of states that are s’s immediate predecessors in the HMM
graph.
These quantities can then be computed for all states and all times according to the
recursions
(1) Initialize
• δ0 (α) = 1,
• Ψ0 (s) = undefined ∀s,
• δ0 (s) = 0 ∀s = α;
(2) Recursion
• δt (s) = maxj ∈pred(s) δt −1 (j )Aj s bt (s),
• Ψt (s) = arg maxj ∈pred(s) δt −1 (j )Aj s bt (s).
274 G. ZWEIG AND M. PICHENY
Thus, to perform decoding, one computes the δs and their backpointers Ψ , and
then follows the backpointers backwards from the final state ω at time N + 1. This
produces the best path, from which the arc labels can be read off.
In practice, there are a several issues that must be addressed. The simplest of these
is that the products of probabilities that define the δs will quickly underflow arith-
metic precision. This can be easily dealt with by representing numbers with their log-
arithms instead. A more difficult issue occurs when non-emitting states are present
throughout the graph. The semantics of null states in this case are that spontaneous
transitions are allowed without consuming any acoustic frames. The update for a
given time frame must then proceed in two stages:
(1) The δs for emitting states are computed in any order by looking at their pre-
decessors.
(2) The δs for null states are computed by iterating over them in topological order
and looking at their predecessors.
The final practical issue is that in large systems, it may be advantageous to use prun-
ing to limit the number of states that are examined at each time frame. In this case,
one can maintain a fixed number of “live” states at each time frame. The decoding
must then be modified to “push” the δs of the live states at time t to the successor
states at time t + 1.
An examination of the Viterbi recursions reveals that for an HMM with A arcs and
an utterance of N frames, the runtime is O(NA) and the space required is O(NS).
However, it is interesting to note that through the use of a divide-and-conquer recur-
sion, the space used can be reduced to O(Sk logk N) at the expense of a runtime of
O(NA logk N) [98]. This is often useful for processing long conversations, messages
or broadcasts. The Viterbi algorithm can be applied to any HMM, and the primary
distinction is whether the HMM is explicitly represented and stored in advance, or
whether it is constructed “on-the-fly.” The following two sections address these ap-
proaches.
F IG . 7. A word lattice. Any path from the leftmost start state to the rightmost final state represents a
possible word sequence.
276 G. ZWEIG AND M. PICHENY
leading into a state must be labeled with the same word sequence). We note also,
that the posterior probability of a word occurrence in a lattice can be computed as
the ratio of the sum likelihood of all the paths through the lattice that use the lattice
link, to the sum likelihood of all paths entirely. These quantities can be computed
with recursions analogous to the HMM αβ recursions, e.g., as in [98].
Once generated, lattices can be used in a variety of ways. Generally, these involve
recomputing the acoustic and language model scores in the lattice with more sophis-
ticated models, and then finding the best path with respect to these updated scores.
Some specific examples are:
• Lattices are generated with an acoustic model in which there is no cross-word
acoustic context, and then rescored with a model using cross-word acoustic
context, e.g., [58,46].
• Lattices are generated with a speaker-independent system, and then rescored
using speaker-adapted acoustic models, e.g., [93].
• Lattices are generated with a bigram LM and then rescored with a trigram or
4-gram LM, e.g., [93,55].
The main potential advantage of using lattices is that the rescoring operations can be
faster than decoding from scratch with sophisticated models. With efficient Viterbi
implementations on static decoding graphs, however, it is not clear that this is the
case [78].
P (w)P (a|w)
w∗ = arg max P (w|a) = arg max .
w w P (a)
Unfortunately, this is not identical to minimizing the WER metric by which speech
recognizers are scored. The MAP hypothesis will asymptotically minimize sentence
error rate, but not necessarily word error rate. Recent work [81,57] has proposed
that the correct objective function is really the expected word-error rate under the
posterior probability distribution. Denoting the reference or true word sequence by r
and the string edit distance between w and r by E(w, r), the expected error is:
EP (r|a) E(w, r) = P (r|a)E(w, r).
r
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION 277
F IG . 8. A word lattice.
There is no known dynamic programming procedure for finding this optimum when
the potential word sequences are represented with a general lattice. Therefore, [57]
proposes instead work with a segmental or sausage-like structure as illustrated in
Fig. 8. To obtain this structure, the links in a lattice are clustered so that temporally
overlapping and phonetically similar word occurrences are grouped together. Often,
multiple occurrences of the same word (differing in time-alignment or linguistic his-
tory) end up together in the same bin, where their posterior probabilities are added
together. Under the assumption of a sausage structure, the expected error can then
be minimized simply by selecting the link with highest posterior probability in each
bin [57]. This procedure has been widely adopted and generally provides a 5 to 10%
relative improvement in large vocabulary recognition performance.
used. The combination of 3 to 5 systems may produce on the order of 10% relative
improvement over the best single system.
6. Adaptation
The goal of speaker adaptation is to modify the acoustic and language models
in light of the data obtained from a specific speaker, so that the models are more
closely tuned to the individual. This field has increased in importance since the early
1990s, has been intensively studied, and is still the focus of a significant amount of
research. However, since no consensus has emerged on the use of language model
adaptation, and many state-of-the-art systems do not use it, this section will focus
solely on acoustic model adaptation. In this area, there are three main techniques:
• Maximum A Posteriori (MAP) adaptation, which is the simplest form of
acoustic adaptation;
• Vocal Tract Length Normalization (VTLN), which warps the frequency scale to
compensate for vocal tract differences;
• Maximum Likelihood Linear Regression, which adjusts the Gaussians and/or
feature vectors so as to increase the data likelihood according to an initial tran-
scription.
These methods will be discussed in the following sections.
The principled use of MAP estimation has been thoroughly investigated in [28],
which presents the formulation that appears here.
The most convenient representation of the prior parameters for p-dimensional
Gaussian mixture models is given by Dirichlet priors for the mixture weights
w1 , . . . , wK , and normal-Wishart densities for the Gaussians (parameterized by
means mi and inverse covariance matrices ri ). These priors are expressed in terms of
the following parameters:
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION 279
• νk ; a count νk > 0,
• τk ; a count τk > 0,
• αk ; a count αk > p − 1,
• µk ; a p dimensional vector,
• uk ; a p × p positive definite matrix.
Other necessary notation is:
• ckt : the posterior probability of Gaussian k at time t,
• K: the number of Gaussians,
• n: the number of frames.
With this notation, the MAP estimate of the Gaussian mixture parameters are:
′ νk − 1 + nt=1 ckt ′ τk µk + nt=1 ckt xt
wk = , mk = ,
n−K + K k=1 νk
τk + nt=1 ckt
n
′ −1 uk + τk (µk − m′k )(µk = m′k )T − m′k )(xt = m′k )T
t =1 ckt (xt
rk = + .
αk − p + nt=1 ckt αk − p + nt=1 ckt
Unfortunately, there are a large number of free parameters in the representation of
the prior, making this formulation somewhat cumbersome in practice. [28] discusses
setting these, but in practice it is often easier to work in terms of fictitious counts.
Recall that in EM, the Gaussian parameters are estimated from first and second-order
sufficient statistics accumulated over the data. One way of obtaining reasonable pri-
ors is simply to compute these over the entire training set without regard to phonetic
state, and then to weight them according to the amount of emphasis that is desired
for the prior. Similarly, statistics computed for one corpus can be downweighted and
added to the statistics from another.
basic idea is to warp the frequency scale so that the acoustic vectors of a speaker are
made more similar to a canonical speaker-independent model. (This idea of “canon-
icalizing” the feature vectors will recur in another form in Section 6.3.2.) Figure 9
illustrates the form of one common warping function.
There are a very large number of variations on VTLN, and for illustration we
choose the implementation presented in [87]. In this procedure, the FFT vector as-
sociated with each frame is warped according a warping function like that in Fig. 9.
Ten possible warping scales are considered, ranging in the slope of the initial seg-
ment from 0.88 to 1.2. The key to this technique is to build a simple model of voiced
speech, consisting of a single mixture of Gaussians trained on frames that are iden-
tified as being voiced. (This identification is made on the basis of a cepstral analysis
described in [40].) To train the voicing model, each speaker is assigned an initial
warp scale of 1, and then the following iterative procedure is used:
(1) Using the current warp scales for each speaker, train a GMM for the voiced
frames.
(2) Assign to each speaker the warp scale that maximizes the likelihood of his or
her warped features according to the current voicing model.
(3) Go to 1.
After several iterations, the outcome of this procedure is a voicing scale for each
speaker, and a voicing model. Histograms of the voicing scales are generally bi-
modal, with one peak for men, and one for women. Training of the standard HMM
parameters can then proceed as usual, using the warped or canonicalized features.
The decoding process in similar. For the data associated with a single speaker, the
following procedure is used:
(1) Select the warp scale that maximizes the likelihood of the warped features
according to the voicing model.
(2) Warp the features and decode as usual.
The results reported in [87] indicate a 12% relative improvement in performance
over unnormalized models, and improvements of this scale are typical [89,96].
As mentioned, a large number of VTLN variants have been explored. [37,61,89]
choose warp scales by maximizing the data likelihood with respect to a full-blown
HMM model, rather than a single GMM for voiced frames, and experiment with
the size of this model. The precise nature of the warping has also been subject to
scrutiny; [37] uses a piecewise linear warp with two discontinuities rather than one;
[61] experiments with a power law warping function of the form
β
′ f
f = fN
fN
where fN is the bandwidth and [96] experiments with bilinear warping functions of
the form
(1 − α) sin(f )
f ′ = f + 2 arctan .
1 − (1 − α) cos(f )
Generally, the findings are that piecewise linear models work as well as the more
complex models, and that simple acoustic models can be used to estimate the warp
factors.
The techniques described so far operate by finding a warp scale using the princi-
ples of maximum likelihood estimation. An interesting alternative presented in [20,
32] is based on normalizing formant positions. In [20], a warping function of the
form
3f/8000
f ′ = ks
is used, where ks is the ratio of the speaker’s third formant to the average frequency
of the third formant. In [32], the speaker’s first, second, and third formants are plotted
against their average values, and the slope of the line fitting these points is used as the
warping scale. These approaches, while nicely motivated, have the drawback that it
is not easy to identify formant positions, and they have not been extensively adopted.
6.3 MLLR
A seminal paper [52] sparked intensive interest in the mid 1990s in techniques for
adapting the means and/or variances of the Gaussians in an HMM model. Whereas
VTLN can be thought of as a method for standardizing acoustics across speakers,
282 G. ZWEIG AND M. PICHENY
and
= H ΣH T
Σ
where W and H are the matrices to be estimated. A procedure for doing this is
presented in [23].
7. Performance Levels
In order to illustrate the error rates attainable with today’s technology—and the
relative contribution of the techniques discussed in earlier sections—the following
paragraphs describe the state-of-the-art as embodied by an advanced IBM system
in 2002 [46]. This system was designed to work well across a wide variety of speak-
ers and topics, and is tested on five separate datasets:
(1) Telephone conversations (Swb98).
(2) Meeting recordings (mtg).
(3) Two sets of call center recordings of customers discussing account informa-
tion (cc1 and cc2).
(4) Voicemail recordings (vm).
In this system, the recognition steps are as follows:
P1 Speaker-independent decoding. The system uses mean-normalized MFCC fea-
tures and an acoustic model with 4078 left context-dependent states and 171K
mixture components.
P2 VTLN decoding. VTLN warp factors are estimated for each speaker using
forced alignments of the data to the recognition hypotheses from P1, then
recognition is performed with a VTLN system that uses mean-normalized
PLP features and an acoustic model with 4440 left context-dependent states
and 163K mixture components.
P3 Lattice generation. Initial word lattices are generated with a SAT system that
uses mean-normalized PLP features and an acoustic model with 3688 word-
internal context-dependent states and 151K mixture components. FMLLR
transforms are computed with the recognition hypotheses from P2.
P4 Acoustic rescoring with large SAT models. The lattices from P3 are rescored
with five different SAT acoustic models and pruned. The acoustic models are
as follows:
A An MMI trained PLP system with 10437 left context-dependent states and
623K mixture components. The maximum value of c0 is subtracted from
each feature vector, and mean-normalization is performed for the other cep-
stral coefficients.
B An MLE PLP system identical to the system of P4A, except for the use of
MLE training of the acoustic model.
C An MLE PLP system with 10450 left context-dependent states and 589K
mixture components. This system uses mean normalization of all raw fea-
tures including c0.
D A SPAM MFCC system with 10133 left context-dependent states and 217K
mixture components.
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION 285
E An MLE MFCC system with 10441 left context-dependent states and 600K
mixture components. This system uses max.-normalization of c0 and mean
normalization of all other raw features.
The FMLLR transforms for each of the five acoustic models are computed
from the one-best hypotheses in the lattices from P3.
P5 Acoustic model adaptation. Each of the five acoustic models are adapted
with MLLR using one-best hypotheses from their respective lattices generated
in P4.
P6 4-gram rescoring. Each of the five sets of lattices from P5 are rescored and
pruned using a 4-gram language model.
P7 Confusion network combination. Each of the five sets of lattices from P6 are
processed to generate confusion networks [57], then a final recognition hy-
pothesis is generated by combining the confusion networks for each utterance.
The performance of the various recognition passes on the test set is summarized in
Table I.
TABLE I
W ORD E RROR R ATES (%) FOR E ACH T EST S ET AT E ACH P ROCESSING S TAGE AND THE OVERALL ,
AVERAGE E RROR R ATE . F OR PASSES W HERE M ULTIPLE S YSTEMS A RE U SED (P4–6), THE B EST
E RROR R ATE FOR A T EST C OMPONENT IS H IGHLIGHTED
8. Conclusion
Over the past decade, incremental advances in HMM technology have advanced
the state of the art to the point where commercial use is possible. These advances
have occurred in all areas of speech recognition, and include
• LDA and HLDA analysis in feature extraction,
• discriminative training,
• VTLN, MLLR and FMLLR for speaker adaptation,
• the use of determinization and minimization in decoding graph compilation,
• consensus decoding,
• voting and system combination.
Collectively applied, these advances produce impressive results for many speakers
under many conditions. However, under some conditions, such as when background
noise is present or speech is transmitted over a low-quality cell phone or a speaker has
an unusual accent, today’s systems can fail. As the error-rates of Section 7 illustrate,
this happens enough that the average error-rate for numerous tasks across a variety of
conditions is around 30%—far from human levels. Thus, the most critical problem
over the coming decade is develop truly robust techniques that reduce the error rate
by another factor of five.
R EFERENCES
[1] Aho A.V., Sethi R., Ullman J.D., Compilers: Principles, Techniques, and Tools,
Addison–Wesley, Reading, MA, 1986.
[2] Aubert X., “A brief overview of decoding techniques for large vocabulary continuous
speech recognition”, in: Automatic Speech Recognition: Challenges for the New Mil-
lennium, 2000.
[3] Axelrod S., Gopinath R., Olsen P., “Modeling with a subspace constraint on inverse
covariance matrices”, in: ICSLP, 2002.
[4] Bahl L.R., Brown P.F., de Souza P.V., Mercer R.L., “Maximum mutual information
estimation of hidden Markov model parameters for speech recognition”, in: ICASSP,
1986, pp. 49–52.
[5] Bahl L.R., et al., “Context dependent modeling of phones in continuous speech using
decision trees”, in: Proceedings of DARPA Speech and Natural Language Processing
Workshop, 1991.
[6] Baker J., “The Dragon system—an overview”, IEEE Transactions on Acoustics,
Speech, and Signal Processing 23 (1975) 24–29.
[7] Bamberg P., “Vocal tract normalization”, Technical report, Verbex, 1981.
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION 287
[8] Brown P.F., et al., “Class-based n-gram models of natural language”, Comput. Lin-
guist. 18 (1992).
[9] Chen S., Eide E., Gales M., Gopinath R., Olsen P., “Recent improvements in IBM’s
speech recognition system for automatic transcription of broadcast speech”, in: Pro-
ceedings of the DARPA Broadcast News Workshop, 1999.
[10] Chen S.S., Gopalakrishnan P.S., “Clustering via the Bayesian information criterion with
applications in speech recognition”, in: ICASSP, 1995, pp. 645–648.
[11] Chen S., et al., “Speech recognition for DARPA communicator”, in: ICASSP, 2001.
[12] Chen S.F., Goodman J., “An empirical study of smoothing techniques for language
modeling”, Technical Report TR-10-98, Harvard University, 1998.
[13] Chomsky N., Aspects of the Theory of Syntax, MIT Press, Cambridge, MA, 1965.
[14] CMU, The CMU Pronouncing Dictionary, 2003.
[15] Linguistic Data Consortium, Callhome American English lexicon (pronlex), 2003.
[16] Davies K., et al., “The IBM conversational telephony system for financial applications”,
in: Eurospeech, 1999.
[17] Davis S., Mermelstein P., “Comparison of parametric representations for monosyllabic
word recognition in continuously spoken sentences”, IEEE Transactions on Acoustics,
Speech, and Signal Processing 28 (1980) 357–366.
[18] Digalakis V.V., Rtischev D., Neumeyer L.G., “Speaker adaptation using constrained
estimation of Gaussian mixtures”, IEEE Transactions on Speech and Audio Processing
(1995) 357–366.
[19] Duda R.O., Hart P.B., Pattern Classification and Scene Analysis, Wiley, New York,
1973.
[20] Eide E., Gish H., “A parametric approach to vocal tract length normalization”, in:
ICASSP, 1996, pp. 346–348.
[21] Fiscus J.G., “A post-processing system to yield reduced word error rates: Recognizer
output voting error reduction (rover)”, in: IEEE Workshop on Automatic Speech Recog-
nition and Understanding, 1997.
[22] Furui S., “Speaker independent isolated word recognition using dynamic features of
speech spectrum”, IEEE Transactions on Acoustics Speech and Signal Processing 34
(1986) 52–59.
[23] Gales M.J.F., “Maximum likelihood linear transformations for HMM-based speech
recognition”, Technical Report CUED-TR-291, Cambridge University, 1997.
[24] Gales M.J.F., “Maximum likelihood linear transformations for HMM-based speech
recognition”, Computer Speech and Language 12 (1998).
[25] Gales M.J.F., Woodland P.C., “Mean and variance adaptation within the MLLR frame-
work”, Computer Speech and Language 10 (1996) 249–264.
[26] Gao Y., Ramabhadran B., Chen J., Erdogan H., Picheny M., “Innovative approaches for
large vocabulary name recognition”, in: ICASSP, 2001.
[27] Gauvain J.-L., Lamel L., Adda G., “The LIMSI 1999 BN transcription system”, in:
Proceedings 2000 Speech Transcription Workshop, 2000, http://www.nist.gov/speech/
publications/tw00/html/abstract.htm.
288 G. ZWEIG AND M. PICHENY
[28] Gauvain J.-L., Lee C.-H., “Maximum a posteriori estimation for multivariate Gaussian
mixture observations of Markov chains”, IEEE Transactions on Speech and Audio
Processing 2 (1994) 291–298.
[29] Gopalakrishnan P.S., Bahl L.R., Mercer R.L., “A tree-search strategy for large vocabu-
lary continuous speech recognition”, in: ICASSP, 1995.
[30] Gopalakrishnan P., Kanevsky D., Nadas A., Nahamoo D., “An inequality for rational
functions with applications to some statistical estimation problems”, IEEE Transactions
on Information Theory 37 (1991) 107–113.
[31] Gopinath R., “Maximum likelihood modeling with Gaussian distributions for classifi-
cation”, in: ICASSP, 1998.
[32] Gouvea E.B., Stern R.M., “Speaker normalization through formant-based warping of
the frequency scale”, in: Eurospeech, 1997.
[33] Graff D., The English Gigaword Corpus, 2003.
[34] Gusfield D., Algorithms on Strings, Trees and Sequences, Cambridge Univ. Press, Cam-
bridge, UK, 1997.
[35] Haeb-Umbach R., Ney H., “Linear discriminant analysis for improved large vocabulary
continuous speech recognition”, in: ICASSP, 1992.
[36] Hain T., Woodland P.C., Evermann G., Povey D., “The CU-HTK March 2000 HUB5E
transcription system”, in: Proc. Speech Transcription Workshop, 2000.
[37] Hain T., Woodland P.C., Niesler T.R., Whittaker E.W.D., “The 1998 HTK system for
transcription of conversational telephone speech”, in: Eurospeech, 1999.
[38] Hermansky H., “Perceptual linear predictive (PLP) analysis of speech”, J. Acoustical
Society of America 87 (1990) 1738–1752.
[39] Hopcroft J.E., Ullman J.D., Introduction to Automata Theory, Languages and Compu-
tation, Addison–Wesley, Reading, MA, 1979.
[40] Hunt M.J., “A robust method of detecting the presence of voiced speech”, in: ICASSP,
1995.
[41] Jan E., Maison B., Mangu L., Zweig G., “Automatic construction of unique signa-
tures and confusable sets for natural language directory assistance applications”, in:
Eurospeech, 2003.
[42] Jeffreys H., Theory of Probability, Clarendon, Oxford, 1948.
[43] Jelinek F., “Continuous speech recognition by statistical methods”, Proceedings of the
IEEE 64 (1976) 532–556.
[44] Kamm T., Andreou A., Cohen J., “Vocal tract normalization in speech recognition:
Compensating for systematic speaker variability”, in: Proceedings of the 15th Annual
Speech Recognition Symposium, Baltimore, MD, 1995, pp. 175–178.
[45] Katz S.M., “Estimation of probabilities from sparse data for the language model com-
ponent of a speech recognizer”, IEEE Transactions of Acoustics, Speech and Signal
Processing 35 (1987) 400–401.
[46] Kingsbury B., Mangu L., Saon G., Zweig G., Axelrod S., Visweswariah K., Picheny M.,
“Towards domain independent conversational speech recognition”, in: Eurospeech,
2003.
[47] Kneser N., Ney H., “Improved backing-off for m-gram language modeling”, in:
ICASSP, 1995.
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION 289
[48] Kuhn R., “Speech recognition and the frequency of recently used words: A modified
Markov model for natural language”, in: 12th International Conference on Computa-
tional Linguistics, Budapest, 1988, pp. 348–350.
[49] Kuhn R., De Mori R., “A cache based natural language model for speech recognition”,
IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (1990) 570–583.
[50] Kumar N., Andreou A.G., “Heteroscedastic discriminant analysis and reduced rank
HMMs for improved speech recognition”, Speech Communication (1998) 283–297.
[51] Leggetter C., Woodland P.C., “Flexible speaker adaptation using maximum likelihood
linear regression”, in: Eurospeech, 1995.
[52] Leggetter C., Woodland P.C., “Speaker adaptation of continuous density HMMs using
multivariate linear regression”, in: ICSLP, 1994.
[53] Leggetter C.J., Woodland P.C., “Flexible speaker adaptation using maximum likelihood
linear regression”, in: Eurospeech, 1995.
[54] Levinson S.E., Rabiner L.R., Sondhi M.M., “An introduction to the application of the
theory of probabilistic functions of a Markov process to automatic speech recognition”,
The Bell System Technical Journal 62 (1983) 1035–1074.
[55] Ljolje A., et al., “The AT&T 2000 LVSCR system”, in: Proceedings 2000 Speech
Transcription Workshop, 2000, http://www.nist.gov/speech/publications/tw00/html/
abstract.htm.
[56] Makhoul J., “Linear prediction: A tutorial review”, Proceedings of the IEEE 63 (1975)
561–580.
[57] Mangu L., Brill E., Stolcke A., “Finding consensus in speech recognition: Word error
minimization and other applications of confusion networks”, Computer Speech and
Language 14 (2000) 373–400.
[58] Matsoukas S., et al., “Speech to text research at BBN”, in: Proceedings of January 2003
EARS Midyear Meeting, 2003.
[59] Mohri M., “Finite-state transducers in language and speech processing”, Comput. Lin-
guist. 23 (1997).
[60] Mohri M., Riley M., Hindle D., Ljolje A., Pereira F., “Full expansion of context-
dependent networks in large vocabulary speech recognition”, in: ICASSP, 1998.
[61] Molau S., Kanthak S., Ney H., “Efficient vocal tract normalization in automatic speech
recognition”, in: ESSV, 2000, pp. 209–216.
[62] Nadas A., “A decision theoretic formulation of a training problem in speech recognition
and a comparison of training by unconditional versus unconditional maximum likeli-
hood”, IEEE Transactions on Acoustics, Speech, and Signal Processing 31 (1983).
[63] Nadas A., Nahamoo D., Picheny M., “On a model-robust training method for speech
recognition”, IEEE Transactions on Acoustics, Speech, and Signal Processing 36
(1988).
[64] Naur P., “Revised report on the algorithmic language Algol 60”, Communications of
the Association for Computing Machinery 6 (1963) 1–17.
[65] Neukirchen C., Klakow D., Aubert X., “Generation and expansion of word graphs using
long span context information”, in: ICASSP, 2001.
[66] Ney H., Essen U., Kneser R., “On structuring probabilistic dependences in stochastic
language modelling”, Computer Speech and Language (1994) 1–38.
290 G. ZWEIG AND M. PICHENY
[67] Niesler T.R., Whittaker E.W.D., Woodland P.C., “Comparison of part-of-speech and
automatically derived category-based language models for speech recognition”, in:
ICASSP, 1998.
[68] Normandin Y., Regis C., De Mori R., “High-performance connected digit recogni-
tion using maximum mutual information”, IEEE Transactions on Speech and Audio
Processing 2 (1994) 299–311.
[69] Odell J.J., “The use of context in large vocabulary speech recognition”, Cambridge
University dissertation, 1995.
[70] Olsen P., Gopinath R., “Extended MLLT for Gaussian mixture models”, IEEE Trans-
actions on Speech and Audio Processing (2001).
[71] Ortmanns S., Ney H., “A word graph algorithm for large vocabulary continuous speech
recognition”, Computer Speech and Language (1997) 43–72.
[72] Pellom B., Ward W., Hansen J., Hacioglu K., Zhang J., Yu X., Pradhan S., “University
of Colorado dialog systems for travel and navigation”, in: Human Language Technolo-
gies, 2001.
[73] Rabiner L.R., Juang B.-H., “An introduction to hidden Markov models”, IEEE ASSP
Magazine (1986) 4–16.
[74] Rabiner L.R., Juang B.-H., Fundamentals of Speech Recognition, Prentice Hall, New
York, 1993.
[75] Rosenfeld R., “A maximum entropy approach to adaptive statistical language model-
ing”, Computer Speech and Language 10 (1996) 187–228.
[76] Sankar A., Gadde V.R.R., Stolcke A., Weng F., “Improved modeling and efficiency for
automatic transcription of broadcast news”, Speech Communication 37 (2002) 133–
158.
[77] Saon G., Padmanabhan M., Gopinath R., Chen S., “Maximum likelihood discriminant
feature spaces”, in: ICASSP, 2000.
[78] Saon G., Zweig G., Kingsbury B., Mangu L., Chaudhari U., “An architecture for rapid
decoding of large vocabulary conversational speech”, in: Eurospeech, 2003.
[79] Schroeder M.R., “Recognition of complex acoustic signals”, in: Bullock T.H. (Ed.),
Life Sciences Research Report 5, Abakon Verlag, 1977.
[80] Seymore K., Rosenfeld R., “Scalable backoff language models”, in: ICSLP, 1996.
[81] Stolcke A., Konig Y., Weintraub M., “Explicit word error minimization using n-best
list rescoring”, in: Eurospeech, 1997.
[82] Stolcke A., “Entropy-based pruning of backoff language models”, in: Proceedings of
DARPA Broadcast News Transcription and Understanding Workshop, 1998, pp. 270–
274.
[83] Stolcke A., “Srilm—an extensible language modeling toolkit”, in: ICSLP, 2002.
[84] Suontausta J., Hakkinen J., Olli V., “Fast decoding in large vocabulary name dialing”,
in: ICASSP, 2000, pp. 1535–1538.
[85] Waitika H., “Normalization of vowels by vocal-tract length and its application to vowel
identification”, IEEE Transactions on Audio Speech and Signal Processing (1977) 183–
192.
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION 291
[86] Wegmann S., Zhan P., Carp I., Newman M., Yamron J., Gillick L., “Dragon systems’
1998 broadcast news transcription system”, in: Proceedings of the DARPA Broadcast
News Workshop, NIST, 1999.
[87] Wegmann S., McAllaster D., Orloff J., Peskin B., “Speaker normalization on conversa-
tional telephone speech”, in: ICASSP, 1996.
[88] Welling L., Haberland N., Ney H., “Acoustic front-end optimization for large vocabu-
lary speech recognition”, in: Eurospeech, 1997.
[89] Welling R., Haeb-Umbach R., Aubert X., Haberland N., “A study on speaker normaliza-
tion using vocal tract normalization and speaker adaptive training”, in: ICASSP, 1998,
pp. 797–800.
[90] Weng F., Stolcke A., Sankar A., “Efficient lattice representation and generation”, in:
ICSLP, 1998.
[91] Whittaker E.W.D., Woodland P.C., “Efficient class-based language modelling for very
large vocabularies”, in: ICASSP, 2001.
[92] Woodland P.C., Povey D., “Large scale discriminative training for speech recognition”,
in: Automatic Speech Recognition: Challenges for the New Millennium, 2000.
[93] Woodland P., et al., “The CU-HTK April 2002 switchboard system”, in: EARS Rich
Transcription Workshop, 2002.
[94] Young S., Odell J., Ollason D., Valtchev V., Woodland P., The HTK Book, 2.1 edition,
Entropic Cambridge Research Laboratory, 1997.
[95] Young S.J., Odell J.J., Woodland P.C., “Tree-based tying for high accuracy acoustic
modelling”, in: ARPA Workshop on Human Language Technology, 1994.
[96] Zhan P., Waibel A., “Vocal tract length normalization for large vocabulary continuous
speech recognition”, Technical Report CMU-CS-97-148, School of Computer Science,
Carnegie Mellon University, 1997.
[97] Zue V., et al., “A telephone-based conversational interface for weather information”,
2000.
[98] Zweig G., Padmanabhan M., “Exact alpha–beta computation in logarithmic space with
application to map word graph construction”, in: ICSLP, 2000.
[99] Zweig G., Saon G., Yvon F., “Arc minimization in finite state decoding graphs with
cross-word acoustic context”, in: ICSLP, 2002.
[100] Zwicker E., “Subdivision of the audible frequency range into critical bands”, J. Acousti-
cal Society of America 33 (1961) 248.
[101] Zwicker E., “Masking and physiological excitation as consequences of ear’s frequency
analysis”, in: Plomp R., Smoorenburg G.F. (Eds.), Frequency Analysis and Periodicity
Detection in Hearing, 1970.
This page intentionally left blank
Author Index
Numbers in italics indicate the pages on which complete references are given.
293
294 AUTHOR INDEX
Lee, C.-H., 251, 278, 279, 288 Mangu, L., 250, 253, 255, 258, 265, 267, 268,
Lee, W., 130, 131, 143, 144 274, 276, 277, 283–285, 288–290
Leeson, J.J., 96, 119 Mann, B., 38, 47, 71
Leggetter, C., 251, 281, 282, 289 Marcial, G.G., 154, 190
Lethbridge, T.C., 16, 32 Mark, B., 128, 146
Leung, A., 161, 162, 190 Markham, T., 128, 145
Leveson, N.G., 18, 21, 31, 33, 134, 144 Marks, D., 77, 118
Levin, D., 137, 144 Maron, M.E., 43, 69, 71
Levinson, S.E., 250, 261, 273, 289 Mateescu, G., 65, 71
Levitt, K., 133, 141 Mathews, G., 133, 141
Lewis, M., 45, 71 Matsoukas, S., 250, 263, 276, 289
Li, M., 217, 241, 245, 248 Maxion, R.A., 136, 146
Li, X., 199, 243 McAllaster, D., 279–281, 291
Liang, F., 241, 248 McCance, O., 185, 190
Liao, L., 234, 247 McCreight, E.M., 219, 246
Libicki, M., 41, 56, 57, 71 McDonough, C.J., 61, 68
Lim, D.T., 217, 245 McGill, M., 43, 72
Lin, M., 42, 69 McGilton, H., 129, 143
Lindell, R., 128, 145 McGraw, G., 129, 146
Linger, R.C., 139, 143 McGregor, O., 185, 190
Linguistic Data Consortium, 258, 287 McHugh, J., 138, 141, 144
Lipman, D.J., 220, 246 McLachlan, A.D., 230, 247
Lippert, R., 238, 248 McPeak, S., 129, 145
Lippmann, R., 130, 138, 144 Mead, N.R., 14, 32, 139, 143
Lipson, H.F., 139, 143 Mehringer, J., 128, 145
Liu, P., 134, 139, 144 Meitzler, W., 52, 68
Ljolje, A., 268, 269, 274, 276, 289 Menezes, A., 89, 118
Lokier, J., 132, 136, 142 Mensik, M., 57, 72
Longstaff, T., 52, 68, 139, 143 Mercer, R.L., 261, 272, 286, 288
Longstaff, T.A., 131, 136, 143 Mermelstein, P., 251–253, 287
Loscocco, P., 132, 143, 144 Michael, C.C., 131, 145
Lowry, A., 129, 145 Miller, K., 8, 33
Lowry, J., 127, 144 Miller, M., 130, 144
Lu, C., 163, 190 Miller, M.J., 234, 247
Lyle, J., 92, 118 Miller, W., 219, 246
Lynch, C., 48, 61, 71 Milner, R., 129, 145
Milosavljevic, A., 216, 245
Miranda, R., 42, 69
M Misra, J., 65, 69
Mohri, M., 268–270, 274, 289
Madan, A., 214, 220, 222, 224, 226, 228, 230, Molau, S., 281, 289
234, 245 Moore, A., 149, 190
Madhusudan, T., 42, 69 Moore, J.W., 10, 32
Maier, D., 132, 142 Morris, J., 132, 146
Maison, B., 250, 288 Morrisett, G., 129, 143
Makhoul, J., 254, 289 Morrison, D.J., 149, 190
Manber, U., 219, 246 Mosteller, F., 63, 67, 72
AUTHOR INDEX 299
MSNBC, 40, 41, 50, 72 Olson, M., 215, 233, 235, 245, 247
Muller, H., 151, 152, 156, 190 Orloff, J., 279–281, 291
Mullikin, J.C., 195, 215, 220, 226, 228, 234, Orr, T.L., 154, 190
237, 243, 248 Ortmanns, S., 275, 290
Mullis, K., 203, 244
Mulyukov, Z., 205, 214, 228, 244
P
Mundici, D., 62, 69, 72
Munson, J.C., 52, 72
Myers, E., 195, 207, 209, 210, 219–221, Padmanabhan, M., 256, 274–276, 290, 291
225–228, 230, 233, 234, 242–244, 246, 247 Pal, P., 139, 144
Myers, G., 234, 247 Pankanti, S., 184, 186, 189
Papadopoulos, C., 128, 145
N Parnas, D., 19, 21, 31, 33
Parsons, R.J., 217, 245
Payne, C., 128, 145
Nadas, A., 261, 262, 288, 289
Pearson, H., 240, 248
Nagel, D., 21, 31
Pearson, W.R., 220, 246
Nahamoo, D., 261, 262, 288, 289
Pe’er, I., 226, 247
Nairn, G., 150, 190
National Institute of Standards, 101, 119 Pellom, B., 250, 290
Naur, P., 264, 289 Peltola, H., 207, 209, 210, 218, 244
Necula, G.C., 129, 145 Pereira, F., 268, 269, 274, 289
Needleman, S.B., 219, 246 Peskin, B., 279–281, 291
NetworkWorldFusion, 51, 72 Pevzner, P., 205, 213, 214, 218, 223, 226–228,
Neukirchen, C., 275, 289 233, 234, 244, 245, 247
Neumann, P., 21, 31, 131, 145 Philips, L., 99, 119
Neumeyer, L.G., 282, 287 Phillips, A., 154, 190
Newman, D., 171, 190 Picheny, M., 250, 253, 255, 258, 261, 276,
Newman, M., 250, 291 277, 284, 287–289
Newsham, T.N., 130, 145 Pimental, F.R., 163, 190
Ney, H., 255, 266, 272, 275, 281, 288–291 Pollitt, M., 77, 118
Niesler, T.R., 250, 272, 281, 288, 290 Pollock, R., 133, 141
Nimmo, D., 38, 39, 69 Pop, M., 214, 234, 239, 245, 247
Ning, Z., 195, 215, 220, 226, 228, 234, 243 Porras, P., 131, 145
Nirenburg, S., 61, 68 Port, E., 201, 202, 243
Nissenbaum, H., 18, 21, 33 Povey, D., 250, 262, 263, 288, 291
Normandin, Y., 262, 290 Powell, D., 139, 141
Notkin, D., 12, 33 Powell, J.I., 234, 247
Nunamaker Jr., J.F., 37, 42, 63, 69, 70, 73 Pradhan, S., 250, 290
Nunberg, G., 63, 70, 71 Pratkanis, A.R., 39, 72
Pressman, R.S., 11, 33
O Prevelakis, V., 127, 144
Priisalu, J., 151, 190
O’Brien, D., 128, 134, 145 Procaccino, J., 149, 183–186, 191
Odell, J., 253, 258, 261, 272, 275, 290, 291 Proffitt, D., 150, 191
Ollason, D., 253, 261, 291 Provost, F., 42, 70
Olli, V., 250, 290 Ptacek, T.H., 130, 145
Olsen, P., 250, 260, 286, 287, 290 Pu, C., 124, 126, 127, 132, 133, 140, 142
300 AUTHOR INDEX
303
304 SUBJECT INDEX
Volume 40
Program Understanding: Models and Experiments
A. VON M AYRHAUSER AND A. M. VANS
Software Prototyping
A LAN M. D AVIS
Rapid Prototyping of Microelectronic Systems
A POSTOLOS D OLLAS AND J. D. S TERLING BABCOCK
Cache Coherence in Multiprocessors: A Survey
M AZIN S. Y OUSIF, M. J. T HAZHUTHAVEETIL , AND C. R. D AS
The Adequacy of Office Models
C HANDRA S. A MARAVADI , J OEY F. G EORGE , O LIVIA R. L IU S HENG , AND JAY F. N UNAMAKER
Volume 41
Directions in Software Process Research
H. D IETER ROMBACH AND M ARTIN V ERLAGE
The Experience Factory and Its Relationship to Other Quality Approaches
V ICTOR R. BASILI
CASE Adoption: A Process, Not an Event
J OCK A. R ADER
On the Necessary Conditions for the Composition of Integrated Software Engineering Environments
D AVID J. C ARNEY AND A LAN W. B ROWN
Software Quality, Software Process, and Software Testing
D ICK H AMLET
Advances in Benchmarking Techniques: New Standards and Quantitative Metrics
T HOMAS C ONTE AND W EN - MEI W. H WU
An Evolutionary Path for Transaction Processing Systems
C ARLTON P U , AVRAHAM L EFF , AND S HU -W EI F. C HEN
Volume 42
Nonfunctional Requirements of Real-Time Systems
T EREZA G. K IRNER AND A LAN M. D AVIS
A Review of Software Inspections
A DAM P ORTER , H ARVEY S IY, AND L AWRENCE V OTTA
Advances in Software Reliability Engineering
J OHN D. M USA AND W ILLA E HRLICH
Network Interconnection and Protocol Conversion
M ING T. L IU
A Universal Model of Legged Locomotion Gaits
S. T. V ENKATARAMAN
315
316 CONTENTS OF VOLUMES IN THIS SERIES
Volume 43
Program Slicing
D AVID W. B INKLEY AND K EITH B RIAN G ALLAGHER
Language Features for the Interconnection of Software Components
R ENATE M OTSCHNIG -P ITRIK AND ROLAND T. M ITTERMEIR
Using Model Checking to Analyze Requirements and Designs
J OANNE ATLEE , M ARSHA C HECHIK , AND J OHN G ANNON
Information Technology and Productivity: A Review of the Literature
E RIK B RYNJOLFSSON AND S HINKYU YANG
The Complexity of Problems
W ILLIAM G ASARCH
3-D Computer Vision Using Structured Light: Design, Calibration, and Implementation Issues
F RED W. D E P IERO AND M OHAN M. T RIVEDI
Volume 44
Managing the Risks in Information Systems and Technology (IT)
ROBERT N. C HARETTE
Software Cost Estimation: A Review of Models, Process and Practice
F IONA WALKERDEN AND ROSS J EFFERY
Experimentation in Software Engineering
S HARI L AWRENCE P FLEEGER
Parallel Computer Construction Outside the United States
R ALPH D UNCAN
Control of Information Distribution and Access
R ALF H AUSER
Asynchronous Transfer Mode: An Engineering Network Standard for High Speed Communications
RONALD J. V ETTER
Communication Complexity
E YAL K USHILEVITZ
Volume 45
Control in Multi-threaded Information Systems
PABLO A. S TRAUB AND C ARLOS A. H URTADO
Parallelization of DOALL and DOACROSS Loops—a Survey
A. R. H URSON , J OFORD T. L IM , K RISHNA M. K AVI , AND B EN L EE
Programming Irregular Applications: Runtime Support, Compilation and Tools
J OEL S ALTZ , G AGAN A GRAWAL , C HIALIN C HANG , R AJA D AS , G UY E DJLALI , PAUL
H AVLAK , Y UAN -S HIN H WANG , B ONGKI M OON , R AVI P ONNUSAMY, S HAMIK S HARMA ,
A LAN S USSMAN , AND M USTAFA U YSAL
Optimization Via Evolutionary Processes
S RILATA R AMAN AND L. M. PATNAIK
Software Reliability and Readiness Assessment Based on the Non-homogeneous Poisson Process
A MRIT L. G OEL AND K UNE -Z ANG YANG
Computer-supported Cooperative Work and Groupware
J ONATHAN G RUDIN AND S TEVEN E. P OLTROCK
Technology and Schools
G LEN L. B ULL
CONTENTS OF VOLUMES IN THIS SERIES 317
Volume 46
Software Process Appraisal and Improvement: Models and Standards
M ARK C. PAULK
A Software Process Engineering Framework
J YRKI K ONTIO
Gaining Business Value from IT Investments
PAMELA S IMMONS
Reliability Measurement, Analysis, and Improvement for Large Software Systems
J EFF T IAN
Role-based Access Control
R AVI S ANDHU
Multithreaded Systems
K RISHNA M. K AVI , B EN L EE , AND A LLI R. H URSON
Coordination Models and Language
G EORGE A. PAPADOPOULOS AND FARHAD A RBAB
Multidisciplinary Problem Solving Environments for Computational Science
E LIAS N. H OUSTIS , J OHN R. R ICE , AND N AREN R AMAKRISHNAN
Volume 47
Natural Language Processing: A Human-Computer Interaction Perspective
B ILL M ANARIS
Cognitive Adaptive Computer Help (COACH): A Case Study
E DWIN J. S ELKER
Cellular Automata Models of Self-replicating Systems
JAMES A. R EGGIA , H UI -H SIEN C HOU , AND JASON D. L OHN
Ultrasound Visualization
T HOMAS R. N ELSON
Patterns and System Development
B RANDON G OLDFEDDER
High Performance Digital Video Servers: Storage and Retrieval of Compressed Scalable Video
S EUNGYUP PAEK AND S HIH -F U C HANG
Software Acquisition: The Custom/Package and Insource/Outsource Dimensions
PAUL N ELSON , A BRAHAM S EIDMANN , AND W ILLIAM R ICHMOND
Volume 48
Architectures and Patterns for Developing High-performance, Real-time ORB Endsystems
D OUGLAS C. S CHMIDT, D AVID L. L EVINE , AND C HRIS C LEELAND
Heterogeneous Data Access in a Mobile Environment – Issues and Solutions
J. B. L IM AND A. R. H URSON
The World Wide Web
H AL B ERGHEL AND D OUGLAS B LANK
Progress in Internet Security
R ANDALL J. ATKINSON AND J. E RIC K LINKER
Digital Libraries: Social Issues and Technological Advances
H SINCHUN C HEN AND A NDREA L. H OUSTON
Architectures for Mobile Robot Control
J ULIO K. ROSENBLATT AND JAMES A. H ENDLER
318 CONTENTS OF VOLUMES IN THIS SERIES
Volume 49
A Survey of Current Paradigms in Machine Translation
B ONNIE J. D ORR , PAMELA W. J ORDAN , AND J OHN W. B ENOIT
Formality in Specification and Modeling: Developments in Software Engineering Practice
J. S. F ITZGERALD
3-D Visualization of Software Structure
M ATHEW L. S TAPLES AND JAMES M. B IEMAN
Using Domain Models for System Testing
A. VON M AYRHAUSER AND R. M RAZ
Exception-handling Design Patterns
W ILLIAM G. BAIL
Managing Control Asynchrony on SIMD Machines—a Survey
N AEL B. A BU -G HAZALEH AND P HILIP A. W ILSEY
A Taxonomy of Distributed Real-time Control Systems
J. R. A CRE , L. P. C LARE , AND S. S ASTRY
Volume 50
Index Part I
Subject Index, Volumes 1–49
Volume 51
Index Part II
Author Index
Cumulative list of Titles
Table of Contents, Volumes 1–49
Volume 52
Eras of Business Computing
A LAN R. H EVNER AND D ONALD J. B ERNDT
Numerical Weather Prediction
F ERDINAND BAER
Machine Translation
S ERGEI N IRENBURG AND Y ORICK W ILKS
The Games Computers (and People) Play
J ONATHAN S CHAEFFER
From Single Word to Natural Dialogue
N EILS O LE B ENSON AND L AILA DYBKJAER
Embedded Microprocessors: Evolution, Trends and Challenges
M ANFRED S CHLETT
Volume 53
Shared-Memory Multiprocessing: Current State and Future Directions
P ER S TEUSTRÖM , E RIK H AGERSTEU , D AVID I. L ITA , M ARGARET M ARTONOSI , AND
M ADAN V ERNGOPAL
CONTENTS OF VOLUMES IN THIS SERIES 319
Volume 54
An Overview of Components and Component-Based Development
A LAN W. B ROWN
Working with UML: A Software Design Process Based on Inspections for the Unified Modeling Language
G UILHERME H. T RAVASSOS , F ORREST S HULL , AND J EFFREY C ARVER
Enterprise JavaBeans and Microsoft Transaction Server: Frameworks for Distributed Enterprise
Components
AVRAHAM L EFF , J OHN P ROKOPEK , JAMES T. R AYFIELD , AND I GNACIO S ILVA -L EPE
Maintenance Process and Product Evaluation Using Reliability, Risk, and Test Metrics
N ORMAN F. S CHNEIDEWIND
Computer Technology Changes and Purchasing Strategies
G ERALD V. P OST
Secure Outsourcing of Scientific Computations
M IKHAIL J. ATALLAH , K. N. PANTAZOPOULOS , J OHN R. R ICE , AND E UGENE S PAFFORD
Volume 55
The Virtual University: A State of the Art
L INDA H ARASIM
The Net, the Web and the Children
W. N EVILLE H OLMES
Source Selection and Ranking in the WebSemantics Architecture Using Quality of Data Metadata
G EORGE A. M IHAILA , L OUIQA R ASCHID , AND M ARIA -E STER V IDAL
Mining Scientific Data
N AREN R AMAKRISHNAN AND A NANTH Y. G RAMA
History and Contributions of Theoretical Computer Science
J OHN E. S AVAGE , A LAN L. S ALEM , AND C ARL S MITH
Security Policies
ROSS A NDERSON , F RANK S TAJANO , AND J ONG -H YEON L EE
Transistors and 1C Design
Y UAN TAUR
320 CONTENTS OF VOLUMES IN THIS SERIES
Volume 56
Software Evolution and the Staged Model of the Software Lifecycle
K EITH H. B ENNETT, VACLAV T. R AJLICH , AND N ORMAN W ILDE
Embedded Software
E DWARD A. L EE
Empirical Studies of Quality Models in Object-Oriented Systems
L IONEL C. B RIAND AND J ÜRGEN W ÜST
Software Fault Prevention by Language Choice: Why C Is Not My Favorite Language
R ICHARD J. FATEMAN
Quantum Computing and Communication
PAUL E. B LACK , D. R ICHARD K UHN , AND C ARL J. W ILLIAMS
Exception Handling
P ETER A. B UHR , A SHIF H ARJI , AND W. Y. RUSSELL M OK
Breaking the Robustness Barrier: Recent Progress on the Design of the Robust Multimodal System
S HARON OVIATT
Using Data Mining to Discover the Preferences of Computer Criminals
D ONALD E. B ROWN AND L OUISE F. G UNDERSON
Volume 57
On the Nature and Importance of Archiving in the Digital Age
H ELEN R. T IBBO
Preserving Digital Records and the Life Cycle of Information
S U -S HING C HEN
Managing Historical XML Data
S UDARSHAN S. C HAWATHE
Adding Compression to Next-Generation Text Retrieval Systems
N IVIO Z IVIANI AND E DLENO S ILVA DE M OURA
Are Scripting Languages Any Good? A Validation of Perl, Python, Rexx, and Tcl against C, C++, and
Java
L UTZ P RECHELT
Issues and Approaches for Developing Learner-Centered Technology
C HRIS Q UINTANA , J OSEPH K RAJCIK , AND E LLIOT S OLOWAY
Personalizing Interactions with Information Systems
S AVERIO P ERUGINI AND N AREN R AMAKRISHNAN
Volume 58
Software Development Productivity
K ATRINA D. M AXWELL
Transformation-Oriented Programming: A Development Methodology for High Assurance Software
V ICTOR L. W INTER , S TEVE ROACH , AND G REG W ICKSTROM
Bounded Model Checking
A RMIN B IERE , A LESSANDRO C IMATTI , E DMUND M. C LARKE , O FER S TRICHMAN , AND
Y UNSHAN Z HU
Advances in GUI Testing
ATIF M. M EMON
Software Inspections
M ARC ROPER , A LASTAIR D UNSMORE , AND M URRAY W OOD
CONTENTS OF VOLUMES IN THIS SERIES 321
Software Fault Tolerance Forestalls Crashes: To Err Is Human; To Forgive Is Fault Tolerant
L AWRENCE B ERNSTEIN
Advances in the Provisions of System and Software Security—Thirty Years of Progress
R AYFORD B. VAUGHN
Volume 59
Collaborative Development Environments
G RADY B OOCH AND A LAN W. B ROWN
Tool Support for Experience-Based Software Development Methodologies
S COTT H ENNINGER
Why New Software Processes Are Not Adopted
S TAN R IFKIN
Impact Analysis in Software Evolution
M IKAEL L INDVALL
Coherence Protocols for Bus-Based and Scalable Multiprocessors, Internet, and Wireless Distributed
Computing Environments: A Survey
J OHN S USTERSIC AND A LI H URSON
This page intentionally left blank