p83 Lin

A Multi-FPGA 10x-Real-Time High-Speed Search Engine
for a 5000-Word Vocabulary Speech Recognizer

Edward C. Lin Rob A. Rutenbar
Electrical & Computer Engineering Electrical & Computer Engineering
Carnegie Mellon University Carnegie Mellon University
5000 Forbes Ave, Pittsburgh PA 15213 5000 Forbes Ave, Pittsburgh PA 15213
eclin@ece.cmu.edu rutenbar@ece.cmu.edu
ABSTRACT deliver results at real-time speed: each hour of audio requires a

Today’s best quality speech recognition systems are implemented significant fraction of an hour of computation for recognition. We
in software. These systems fully occupy the resources of a high- need much faster recognizers to: triage vast streams of audio
end server to deliver results at real-time speed: each hour of intercepts for threats to national security; extract searchable text
audio requires a significant fraction of an hour of computation for from the exponentially growing torrent of audio being uploaded to
recognition. This is profoundly limiting for applications that the web; mine business intelligence from recorded customer-agent
require extreme recognition speed, for example, high-volume interactions in call centers; provide real time transcription of
tasks such as video indexing (e.g., YouTube), or high-speed tasks medical data. Existing commercial solutions for these so-called
such as triage of homeland security intelligence. We describe the audio mining [2] problems fall far short: they cannot be
architecture and implementation of one critical component – the simultaneously fast (20-50x faster than real-time), accurate (~5%
backend search stage – of a high-speed, large-vocabulary errors), and cost-effective.
recognizer. Implemented on a multi-FPGA Berkeley Emulation Our solution is to move today’s best-quality speech recognition
Engine 2 (BEE2) platform, we handle a standard 5000-word Wall strategies directly into silicon. This is the same path taken by
Street Journal speech benchmark. Our backend search engine can compute-intensive tasks such as graphics, which have seen
decode on average 10 times faster than real-time running at 100 performance improvements of six orders of magnitude over the
MHz, i.e, 10x faster than real-time, with negligible degradation in last decade.
accuracy, running at a clock rate ~ 30x slower than a conventional
server. To the best of our knowledge, this is both the most There have been several prior efforts in the area of hardware
complex, and the fastest recognizer ever to be realized in a accelerators for speech recognition. However, none of these
hardware form. target the unique problems of the large-scale audio mining
problem. Perhaps the best-known early effort was a full-custom,
Categories and Subject Descriptors large-vocabulary NMOS-based recognizer done at Berkeley in
C.3 [Special-Purpose and Application-Based Systems]: Signal 1991 [3]. Two decades of progress on fundamental recognition
processing algorithms have eclipsed most of the strategies used in this early
hardware. Other, more recent custom hardware speech recognition
General Terms designs have been developed, but they rely on the dynamic-time
Algorithms, Performance, Design warp (DTW) algorithm [4] [5], which lacks the ability to scale to
large-vocabulary tasks without unacceptable degradation in
Keywords accuracy. Several more limited experiments have also been
Speech Recognition, FPGA, DSP, In Silico Vox reported: a pipelined scoring co-processor for reduced power
consumption [6]; and several low-power multi-processor
1. INTRODUCTION architectures [7,8]. However, these either show very limited
Speech recognition tools translate human speech data into (Amdahl’s law limited) speedup [7], or target vocabularies too
searchable text. Whether running on a desktop PC or an enterprise small to be of relevance for large-scale audio mining [8].
server farm, all of today’s state-of-the-art recognizers exist as
complex software running on conventional computers [1]. This is In prior work as part of the Carnegie Mellon In Silico Vox project
profoundly limiting for applications that require extreme [9], we have demonstrated several designs and prototypes for
recognition speed. Today’s most sophisticated recognizers fully hardware-based speech recognition. Of particular note was the
occupy the computational resources of a high-end server to recent demonstration of the first single-FPGA implementation of a
complete, state-of-the-art recognizer architecture [10]. That
design supported live-mode (live microphone input and text
Permission to make digital or hard copies of all or part of this work for output), speaker-independent, connected speech recognition.
personal or classroom use is granted without fee provided that copies are Although that design’s recognition speed and vocabulary (1000
not made or distributed for profit or commercial advantage and that copies words) were severely constrained by platform resources, the
bear this notice and the full citation on the first page. To copy otherwise, experience provided us with a deeper understanding of the
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
requirements needed to build a high-speed, large-vocabulary
FPGA’09, February 22–24, 2009, Monterey, California, USA. system.
Copyright 2009 ACM 978-1-60558-410-2/09/02...$5.00.
83
The most common architecture for a modern recognizer (Figure 1) Once speech is input to a microphone and sampled by an analog
comprises three stages: feature extraction, acoustic scoring, and to digital converter, it is organized into 10 ms blocks called
backend search. Likewise, we envision a hardware-based frames. Each frame is converted from a time domain signal to
recognizer as comprising a 3-stage pipeline of hardware frequency domain, where more unique features of speech can be
implementations of each of these essential steps. As we shall detected and analyzed in the feature extraction stage. From there,
show, in the large-vocabulary case, the backend search stage is the a feature vector is generated containing all the acoustic
most limiting bottleneck for the overall design. Thus, in this information for the frame, which is passed on to the acoustic
paper, we demonstrate a fully functional backend search engine scoring stage. Scoring matches the sound “heard” in each frame
implemented in FPGA form. against a large library of atomic sounds. Mechanically, this is
usually done using Gaussian mixture models (GMM). From these
sound probabilities, the backend search stage finds the most
probable word sequence by gluing together sounds using their
probabilities. In the following subsections we elaborate on
algorithms in the backend search stage.
2.1 Acoustic Model
After each frame of speech has produced sound probabilities, the
backend search uses a two-layer model to associate sounds to
Figure 1. Speech recognition decoding flow word sequences [1]. In the first layer, an acoustic model maps
sounds to words. The granularity for which speech recognition
To validate our architecture, we implemented it on a Berkeley systems examine sounds is at the sub-word unit called phonemes.
Emulation Engine 2 (BEE2) FPGA platform [11], and took Depending on language, the number of phonemes can differ, for
advantage of its large FPGA resources and high-speed DRAM example the English vocabulary has approximately 50 unique
bandwidth. For performance comparisons, we used the open phonemes. When these phonemes are pronounced, the acoustic
source CMU Sphinx 3.0 recognizer software [12]. We used a 5K realization of each sound, called the phone, varies depending on
word vocabulary Wall Street Journal (WSJ) [13] model and its context. Recognition systems model these effects by taking
benchmark to evaluate performance. We were able to achieve a into consideration the phones directly preceding and following the
roughly 10x speedup over real-time, with negligible loss of phone, commonly referred to as triphones. To more accurately
accuracy when compared to the software reference model. To the model triphones in speech, sub-phonetic sounds are considered,
best of our knowledge, this is both the most complex, and the known as states. In Sphinx 3.0, each triphone is made up of 4
fastest recognizer search engine ever to be realized in hardware. states, 3 of which are associated with sounds, and a 4th state
representing an end-of-triphone position or a null state. Because
The rest of this paper is organized as follows. In Section 2 we there are only a limited number of sounds that can be produced,
briefly discuss the architecture of a modern speech recognizer and the states are clustered into a smaller, more manageable set of
the key algorithms used. We discuss in particular our reference atomic sounds called senones [14]. Hierarchically, words are
architecture, CMU Sphinx 3.0 with focus on the backend search broken up into triphones and triphones are broken up into states
stage. In Section 3, we profile and analyze the performance of (senones), and for each senone, a probability is generated for each
CMU Sphinx 3.0’s backend search stage by examining decoding frame from the acoustic scoring unit.
speed. Section 4 describes the datapath and memory
optimizations necessary to achieve a fast backend search engine.
Section 5 describes a sequence of two hardware designs: a single
FPGA and then a dual FPGA prototype. Section 6 gives
experiments results from these fully functioning backend search
engines. Finally, Section 7 offers concluding remarks.
2. SPEECH RECOGNITION OVERVIEW

Automatic speech recognition transforms human speech into text.
For simple tasks (unconnected words with pauses, speaker
dependent, small vocabularies), recognition is fairly easy, and
speech algorithms can easily run in real-time on even a small
processor. But for more complex tasks (continuous speech, Figure 2. Word hierarchy
speaker independent, large vocabularies) recognition with high
accuracy requires high-end processors just to achieve acceptable
2.2 Hidden Markov Models
decoding speeds. Because the task of speech recognition is to find the most likely
words from speech features, the use of Hidden Markov Models
To understand how a modern large-vocabulary recognizer works, (HMMs) [15] has been found to be an effective means to organize
we describe in this section the well-know open source CMU speech at the acoustic level. The essential property of HMMs is
Sphinx 3.0 recognizer. Architected for high accuracy, Sphinx 3.0 that, based on observations (speech), we can infer through
runs relatively slowly compared to some recent speed-optimized probability the likelihood a hidden state (sub-phonic sound of a
recognition systems, but its framework provides a solid reference word) is reached. In Sphinx 3.0, triphones are represented as 4-
for our custom hardware design. As with most robust recognizers, state HMMs where each non-null state has a self transition and a
Sphinx 3.0’s decoding consists of three key stages as shown in transition to the state to its right (Figure 2); this captures the
Figure 1.
84
notion that speech over time can either remain in the same state or 2.5 Backend Search
continue down the phonetic chain. The backend search stage can be divided into three distinct phases
2.3 Viterbi Beam Search for a given frame of speech based on the algorithms discussed
To find the most likely word sequence, HMMs are searched using above. The first phase handles Viterbi calculations of active
the Viterbi algorithm [16], a time-synchronous dynamic HMM states, where senone scores from the acoustic scoring are
programming technique. For each frame of speech, the Viterbi used to update active states or transition to new states within a
algorithm updates all state probabilities (scores) with its current phone (within triphone transition). The second phase includes
observation (senone) probability and state transition probability. within word transitions (activations of new triphones) and beam
At the end of each frame, only the highest probability of the pruning on improbable triphones. Finally, the last phase uses the
successor state is maintained and is considered to be the best path n-gram model to spawn new words when words are completed
to that state. The equation for the best path to successor state j at and a cross word transition is applied. In the next section, we
time t from all preceding states i can be summarized in Equation. present a deep examination of the computational requirements for
1.
Pt(j) = max[Pt-1(i)aij]bj(Ot) for all states that transition to j (1)
where aij is the transition probability from state i to state j, bj is the
observation probability of state j, and Ot is the observation at time
step t. However, just knowing the final best path solution is not
enough to find the most likely word sequence, as a history of
transitions is needed to maintain when each word was uttered.
By searching through and updating every state of every frame in
the vocabulary, the Viterbi algorithm is guaranteed to find the
most likely state sequence. However, this search space for large
vocabulary systems is vastly too large for even high-end
processors to decode in reasonable time. As a result, pruning of
less probable states is used to reduce the search space after each
time step; this is commonly referred to as beam pruning. States
that are unpruned after this stage are referred to as active.
Although beam search may reduce recognition accuracy, the time Figure 3. Profiles of the 1K RM, 5K WSJ, and 60K
saved by using it is significant. HUB4 models
2.4 Language Model backend search by profiling CMU Sphinx 3.0.

While the acoustic model can successfully capture within triphone
(state-to-state) transitions and within word (triphone-to-triphone) 3. PROFILING CMU SPHINX 3.0
transitions, it cannot without good accuracy capture issues with In [10], we profiled CMU Sphinx 3.0 using the processor
cross word (word-to-word) transitions where word ambiguity can simulator, SimpleScalar and determined the memory hierarchy of
cause confusion. Examples of this include homophones (like processors is ill-suited for speech recognition. We also analyzed
“too” and “two”, or “flew” and “flu”) and word boundaries. its decoding patterns by examining algorithm implementation and
Homophones cannot be differentiated by recognizers without monitoring CPU processing time while running a medium 1K
grammar knowledge. In the same manner, incorrect labeling of speech benchmark. We revisit CMU Sphinx 3.0 to study the
word boundaries can cause words like “website” to be effects of how different sized vocabulary models affect the
misconstrued as “web sight”. To handle this, most modern processing speed of each decoding stage. Based on the
recognizers add another layer on top of the acoustic model called breakdown and our knowledge of the algorithms, we locate the
the language model. Sphinx 3.0 uses more likely-to-be-stated most timing critical portions of decoding and address them with
words and phrases (word sequences) to direct recognition. This is higher priority when designing the hardware for it. For these
referred to as an n-gram language model. Assigning probabilities experiments we used three models and benchmarks, including a
according to how often words and phrases appear during training, 1K word Resource Management (RM) task, a 5K word Wall
the n-gram model considers words, word pairs, and word triples, Street Journal (WSJ) task, and a 60K word Broadcast news
referred to as unigrams, bigrams and trigrams, respectively. (HUB4) task [13]. The profile for each model is shown Figure 3.
Because the memory requirements exponentially increase with the
number of words, recognizers typically stop at trigrams. In Each decoding stage has its own distinct arithmetic and memory
practice, the n-gram language model is administered after a cross requirements. Feature extraction comprises of signal processing
word transition is made, during which the completed word uses algorithms which has little computation and few memory
the n-gram probabilities to determine which new words to accesses. Acoustic scoring involves computing a set of Gaussian
activate. For example, if an active word “are” makes a cross word mixtures for each senone and demands moderate computation and
transition and is preceded by the word “how”, a list of trigrams memory bandwidth. Both feature extraction and acoustic scoring
will appear with words that are more likely to follow, such as computation times are fixed from one frame to the next. Backend
“you” and “they”. With a preset probability associated with each search, on the other hand, has a dynamic number of active states
n-gram, recognizers are able achieve higher accuracy with the that are each processed through one time step of the Viterbi
extra grammar information. algorithm every frame. Of the active triphones some will make
within word transitions, activating new triphones, and some will
85
make cross word transitions, triggering the n-gram language We did a more detailed study of Sphinx 3.0’s backend search by
model where a list of word candidates is searched for word examining the decoding speed and breaking down the
activation. The computation is moderate, memory bandwidth is computation time into sub-stages: Viterbi, Transition/Pruning, and
large, and memory access pattern noncontiguous. The processing Language model (Figure 5). The results show the Language
time is largely dependent on the number of active states and sizes model and Viterbi stages consume the most time, and for larger
of language model word candidate lists searched each frame. vocabulary sizes, the time spent in the Language model
proportionally increases. When designing the hardware
architecture, we pay particular attention to speeding up the two
stages.
4. CUSTOM HARDWARE BACKEND
SEARCH ENGINE
Our goal for our fast speech recognition backend search engine is
to achieve a considerable speedup over software systems running
on high end processors, while using fewer resources and running
at a fraction of the clock speed. Our target platform for our
design is the BEE2 FPGA board which contains 5 Xilinx Virtex-II
Pro XC2VP70 FPGAs and 4 DDR2 DRAM channels for each
FPGA. In this section, we go over our approach to developing our
architecture and introduce our custom datapath and memory
system developed for the BEE2 platform. We use decoding speed
Figure 4. % of time spent on each stage of decoding to measure speedup and word error rate (WER) to measure
running CMU Sphinx 3.0 accuracy as metrics to evaluate speech recognition performance.
We ran the three benchmarks on a 2.8 GHz Xeon processor with 4.1 Approach
2GB of memory, monitoring decoding speed of each stage (Figure Because software algorithms do not always map well to hardware,
4). The experiment revealed that the compute time for feature several modifications were made to the CMU Sphinx 3.0
extraction constituted less than a percent of overall processing, algorithms to speed up our hardware implementation as discussed
while time spent on acoustic scoring grew linearly with the total in [10]. No further functional modifications have been made to
number of senones. Backend search dominated computation time the algorithms. To account for the changes we generated a
with larger vocabulary sizes due to the high number of active software backend search engine which matched our custom
states and language model word candidates. Based on the hardware design algorithmically so verification of word accuracy
profiling, we set our sights on speeding up larger vocabulary tasks and design correctness could be made quickly and thoroughly.
(at least 5K) since smaller benchmarks already easily run in real- Afterwards, we extracted memory structures from the software
time on general purpose CPUs. We focus on the backend search recognizer to generate the DRAM memory for our FPGA design.
stage because of its long processing time and frame-to-frame Then, we designed the baseline custom hardware architecture in
computation variability. Hardware is also appealing to backend Verilog HDL and used Mentor Graphics Modelsim to functionally
search as it can significantly benefit from a custom datapath and verify correctness and provide cycle count information. From
memory hierarchy and organization with its large, irregular there, we synthesized the design in Xilinx ISE/EDK and placed it
memory access patterns. The other stages, while important, have on the BEE2 for validation. Once verified to be correct, we
clearer paths to high performance with custom hardware: feature optimized for speed to fully utilize a single FPGA on the board.
extraction needs little computational resources and acoustic To further increase speedup, we split our design across two
scoring (GMM) is inherently parallelizable as mentioned in [10], FPGAs on the board to get access to more area. In doing so, we
so long as it has adequate memory bandwidth. had to design a sophisticated interface to reduce communicate
overhead between the FPGAs but still maintain correct
functionality.
4.2 Architecture
Our architecture evolved from a straightforward serial
implementation of the algorithms to an extremely parallel and
pipelined design with multiple HMM entries (4-state HMMs) in-
flight at any given time. In this section we give an overview of
development of the high performance speech recognition backend
search engine.
We began our pursuit towards a high performance engine by
creating a baseline datapath to lay the foundation for which our
optimizations would be built upon. The baseline datapath is
broken up into 4 stages, Fetch, Viterbi,
Transition/Writeback/Prune, and Language model. The roles of
Figure 5. % of time spent in each backend search stage each stage are as follows (Figure 6):
86
One of the most challenging aspects of the backend search engine
is how to effectively partition where its large data set should be
stored, whether in BRAM or DRAM. Because different tasks
have many varying size memory structures, we opted to store
everything in DRAM unless it was guaranteed that the data
structure (such as senone scores) could be completely stored in
BRAM regardless of task size. Unfortunately because memory
plays such a critical role in the performance of the stage, we look
towards caches and buffers to help speed the design up in our
optimized implementation.
After using our baseline datapath to estimate the area and timing
requirements for our backend search engine to run on a BEE2
FPGA, we developed a performance driven design running on a
single FPGA with several key optimizations including pipelining,
parallelism, prefetching, and buffering.
Figure 6. Baseline backend search datapath
Fetch: Much like an instruction fetch unit of a CPU, the Fetch

stage retrieves active HMM entries (triphone) from memory for
the current frame and passes them to the Viterbi stage to be
updated. Fetch has three methods of scheduling an HMM entry:
1) retrieve an entry from the Active HMM Queue 2) create an
entry if a word gets newly entered by cross word transitions Figure 7. Pipelined backend search engine example
(stored in the Patch List) 3) update an entry if a word exists in the
Patch List and an entry for the first phone of the same word exists Pipelining: With extra area, we pipelined our design across the 4
in the Active HMM Queue. Each HMM entry maintains 256 bits stages to better utilize the hardware resources and speed up
of data, including fields such as wordID, triphone, previous state processing. As seen in Figure 7, when the first HMM entry is
scores, history, and phone position. In our design we have 5 types fetched and sent to the Viterbi stage, another HMM entry can be
of HMM entries with different fields, based on phone position and fetched at the same time. Because the Language model is not
word length to optimize decoding. Each HMM entry is passed to accessed often, HMMs can continue to execute so long as the
the Viterbi stage, in word and phone position order. Maintaining Language model is not needed. If it is busy and another HMM
the order is necessary to correctly update states for within word requires the stage, instead of stalling, a circular queue is used to
transitions. buffer completed word so HMMs can continue to be decoded.
When the Language model becomes free, the buffer is accessed to
Viterbi: For each of the 4 states in the HMM entry, Viterbi
handle the next completed word.
algorithm updates the previous state score with senone scores and
transition probabilities in parallel. To reduce computation cost, Parallelism: Since each active HMM entry passes through the
probabilies are stored in log form to replace multiplications with same decoding path, replication of the datapath can be done to
additions. As a result, Add-Compare-Select (ACS) units can be further increase performance. The only restriction is that HMMs
used to determine the best path for each state. for a word must stay together in the same partition so transitions
would be correctly handled. Unfortunately, the area required to
Transition/Prune/Writeback: Beam pruning removes HMM completely duplicate the design is too large for 1 FPGA. Instead,
entries when each score in the triphone is less than a set threshold we parallelized slower stages, including Fetch and parts of the
(determined by the best score of the previous frame). HMM Language model. We also replicate the
entries that remain active after pruning get written back to the Transition/Prune/Writeback stage to maintain correctness. With
Active HMM Queue. When the last state of the active HMM the Language model consuming a significant amount of decoding
entry (that is not the last phone of the word) passes a transition time, 4 Word Enter modules are created to quickly compute and
threshold, a within word transition is made to the first state of the compare word candidates for activation. Functionally, each Fetch
following triphone’s HMM entry, causing two entries to be unit and Transition/Prune/Writeback unit is responsible for a
written back to the Active HMM Queue. If the following HMM subset of the vocabulary, while the Viterbi unit is shared among
entry exists already, the entry is updated when it pass through the all words. The design synchronizes whenever a frame of speech
stage. is completed. Scheduling for the Language model is also
Language model: Once the last state of the last phone of a word necessary since potentially two words can complete from each
passes a word threshold, a set of modules are called to handle n- Transition/Prune/Writeback unit during a cycle.
gram lookup to find the most likely word candidates to spawn. Prefetching: The backend search stage’s performance suffers
When word candidates are noted to be probable, they become from the large quantity of data that has to be loaded from DRAM
active and are stored into the Patch List through a module called each frame. To speed up memory requests, we take advantage of
Word Enter for the next frame to retrieve. A bit vector list is knowing the Active HMM Queue is stored contiguously in
stored in BRAM to quickly determine whether words have been DRAM to prefetch the entries and store them locally in BRAM
activated or not. for the Fetch stage. With the DRAM able to service rows that are
already open fast, along with being able to handle pipelined
87
requests, the active HMM entries for a frame can be quickly stored in BRAM and the map table data hitting often in the cache,
serviced by the DRAM all at once. the Viterbi stage typically does not require going to DRAM.
Buffering: In our design we use buffering to reduce stalls created Language model cache: In the Language model, words that make
by memory and by busy modules. For example in the Language cross word transitions do not change much over time; therefore a
model stage, n-gram word candidates from DRAM are continually cache to maintain all possible word candidates is extremely
fetched and buffered as Word Enter modules determine whether beneficial so long as the cache is large enough to cover the data
they should be stored into the Patch List or not. Stored set. Rarely is the working set larger than the cache, thus the
contiguously in memory, retrieving the word candidates and module does a good job of reducing memory access times.
storing them in BRAM helps reduce memory latency.
While our single FPGA implementation can achieve a
The complete optimized single FPGA design is shown in Figure considerable speedup, further optimizations can be made to
8, displaying the parallelization and pipelining of each stage. In increase performance with more area. With the flexibility of
the diagram, Fetch 0 and Transition/Prune/Writeback 0 takes care BEE2, we have the chance to create an even higher performance
of words starting with ‘A’ through ‘L’, and the other partition design by expanding our datapath across two FPGAs.
takes care of words that start with ‘M’ through ‘Z’.
We decided the best way to achieve the highest performance was
to replicate the single FPGA design onto the second FPGA, such
that one FPGA is responsible for creating, updating, and pruning
HMMs for a subset of the vocabulary while the other is
responsible for the rest of the words. We use one FPGA as the
master, which dictates when events should occur, and the other
FPGA as a slave, which sends and receives data according to what
the master FPGA’s command (Figure 9). With board level
communication slow and bitwidth restricted, we tried to design
the system with as little communication across the board as
possible, while achieving high performance.
Figure 8. Optimized (single FPGA) datapath
With our optimized datapath, pipelining and parallelism can cause

multiple outstanding DRAM requests from different sources. To
prevent starvation, we schedule by priority and first-come-first
serve. Sequential consistency is maintained for each frame
because each active HMM entry is only written once per frame
and while a word may be entered multiple times by different cross
word transitions, only the best score is kept for each frame.
To reduce memory latency of our system we maintain 3 types of Figure 9. Dual FPGA Datapath
read-only BRAM caches. Though we have previously shown
poor cache performance for software speech recognition systems Master FPGA: The master FPGA has three major roles: 1)
running on a processor [4], we selectively cache to only collect maintain active HMM entries for words that start with ‘A’ through
data that is reused often and can mostly fit in the cache. Because ‘L’, 2) determine and send information to the slave regarding
many of the active HMM entries within neighboring frames are which completed word accesses the Language model for both
the same, there are several data structures which are used over and FPGAs, and 3) announce when a frame has finished decoding.
over again including senone indexes, triphone lookup, and
Active HMM entry updates not only includes Viterbi, but also N-
Language model word candidates. The 3 caches include:
gram Language model, where only word candidates that start with
Fetch cache: Whenever an active word gets read from the Patch ‘A’ through ‘L’ are searched. The master also stores the
List during the Fetch stage, a pointer lookup is necessary to completed word list (Word Lattice) which after decoding
determine the triphone for the first state of the active HMM entry completes is back-traced to find the most likely word sequence.
whether it is a new HMM or updating an already active one. The
Fetch cache stores the series of triphone lookups each frame, in Slave FPGA: The slave FPGA has five major roles: 1) maintain
hopes that in the next frame the same data will be accessed again. active HMM entries for words that start with ‘M’ through ‘Z’, 2)
send out necessary information to the master when a word
Viterbi cache: To speed up the process of the Viterbi stage, we completes, 3) start the Language model and accept cross word
cache accesses to a map table used to translate triphones to senone transition information when the master send it, 4) inform the
indexes. While some of the active HMM entries already maintain master FPGA when it has completed decoding for the frame, and
the senone indexes along with the entry as one of its fields, there 5) finally update information received by the master regarding the
are still quite a few accesses to the cache. With the senone scores end of a frame (such as pruning threshold).
88
Much like the datapath, the memory is almost an exact replica of DRAM controller. The asynchronous DRAM interface can
the single FPGA memory where the complete design uses twice transfer at most 256 bits of data each cycle to fully utilize the
the BRAM and two DRAM channels. channel. The interface has pipelined requests, such that there can
be multiple outstanding requests. For data that is random access,
5. BACKEND SEARCH ENGINE maximally pipelined reads are handled roughly once every 5
IMPLEMENTATION 100MHz cycles (for a total of 640MB/s). The latency for a single
We discuss several detailed system implementation issues, for read request is 18-19 cycles.
how we mapped our architecture onto the BEE2 platform. Before decoding begins, the recognizer requires an initialization
period to move static model data (all the word information for
5.1 Implementation Framework HMM and probabilities for n-gram word candidates) from the
The BEE2 platform has large FPGA resources and can support Compact Flash card to the DRAM for each of the FPGAs (Figure
high memory bandwidth. We use it as a proof of concept to 10). The data is transferred using a PowerPC core and through
demonstrate our high performance architecture.
our recognizer to DRAM. With the dual FPGA implementation,
The BEE2 platform has 5 FPGAs, 1 control and 4 user. The the data is transferred from the control FPGA to the user FPGAs
control FPGA has direct access to peripherals such as Ethernet, through board level communication to the user FPGA’s DRAM.
RS232, and CompactFlash, and direct links to all user FPGAs. Once the data is transmitted, control is then passed to the user
Each user FPGAs has direct links with 2 neighboring user FPGAs FPGAs to access their own DRAM.
as well as the control FPGA. In our setup, we load the FPGA
bitstreams directly from CompactFlash. One of the PowerPC 5.3 Board-level communication
cores on the control FPGA is used to support pre-decoding
memory transfers (loading models into DRAM) and I/O
(displaying data through RS232). While the processor is used to
support the backend search engine, the engine itself is
implemented in complete custom hardware.
For our single FPGA implementation, we set a target of 100MHz
for the clock rate. We design the backend search engine to run on
the control FPGA to refrain from the use of board level
communication. Our testing method consists of loading senone
scores for a frame of a speech, starting the backend search engine, Figure 10. Accessing DRAM 1) Compact Flash loads data
and waiting till the stage completes before loading the scores for into DRAM through PowerPC and our backend search core
the next frame of speech until all the frames are completed. To 2) backend search DRAM access path, no use of the PLB
verify functionality of each frame of speech, we set up an
interface from the PowerPC to read from DRAM to compare the
scores for each active HMM entry. As a sanity check we also To communicate between FPGAs, we used inter-chip links which
keep track of the number of active HMM entries written for each use a 200MHz clock to transfer data with no support by the
frame. Once all the frames have been decoded, the word PowerPC. For our dual FPGA design, we use the inter-chip
sequence is displayed through RS232. interface provided in the BEE2 library, which uses 138 pins, 69
We set the dual FPGA implementation to also run at 100MHz. bits each way from user FPGA to user FPGA. To load static data
Unlike the single FPGA system, we placed our design on two user and do testing we use the control FPGA, which uses 64 pins, 32
FPGAs in anticipation of the design being used in a complete bits each way. The inter-chip links allow for pipelined data
speech recognition system. With one of the control FPGA’s transfers so the maximum bandwidth for the user FPGAs is
PowerPC, we load the user FPGAs’ bitstreams through 3.5GB/s and from the control FPGA to a user FPGA is 800MB/s
SelectMAP (a low bandwidth communication channel). For both ways. We were more worried about latency so we made sure
testing, we use the control FPGA to load senone scores and start it was not going to be a limiting factor by testing the Inter-chip
the decoding of a speech frame to both user FPGAs. Once all the modules and determining that the latency was 5 200MHz cycles
frames are completed, the control FPGA requests the master user from one FPGA to the other. To interface between the different
FPGA for the word sequence where it is displayed through clock domains we used an asynchronous FIFO to pass data.
RS232. We also create an interface through the control FPGA
that allows the PowerPC to read DRAM data from both user
6. EXPERIMENTAL RESULTS
In this section we discuss our simulation results and discuss the
FPGAs to verify the active HMM entries are being decoded
area and memory requirements for our single and dual FPGA
correctly.
implementation of a fast speech recognition backend search
5.2 DRAM engine.
On the BEE2, each FPGA is connected to 4 DDR2 SDRAMs of
64 bit wide data. But more telling is the speed in which we can 6.1 Hardware Simulator Results
retrieve data; each DRAM channel has a maximum bandwidth of Before moving to the BEE2, we simulated our design thoroughly
25.6 Gb/s or 3.2GB/s (the DDR2 controller runs at 200 MHz). across the 5K WSJ task (330 utterances for a total of 40 minutes
However to achieve such data rates, the data must be stored in of speech) in Modelsim to estimate decoding speed and decide the
consecutive addresses in DRAM. Since we plan to run our design optimal number of BRAM to use for caches and buffers. We also
at 100 MHz, we used the BEE 2 library’s asynchronous DRAM looked at the efficiency of our datapath and examine DRAM
interface, which allows our design to interface with the faster utilization.
89
With BRAM size limited in configurations, we analyzed how the to DRAM. The Transition/Prune/Writeback stage does not have
size of the buffers and caches affect the performance of our single to wait long for he Language model because of the completed
FPGA design. Each entry in the Active HMM Queue Buffer is word buffer which absorbs the stalls. We note that while another
256 bits, requiring a baseline cost of 8 BRAMs. For sizes up to Viterbi unit would provide better performance, the area of one of
512 deep the number of BRAM needed remains the same. While the FPGAs was not large enough to support it and still meet the
a deeper buffer can provide better performance, the speedup 100MHz clock constraint.
degrades with buffer depth and stops when all the active HMM
entries can fit in the buffer. Through simulation we find that the The DRAM bandwidth requirement for our design is ~93MB/s for
ideal BRAM size is 8. real-time processing, with ~66% reads and ~34% writes. For our
design, which achieves 6 times faster than real-time, it gets
558MB/s. We examined whether there is a memory bottleneck in
the system, by reducing the DRAM latency to 1 and by increasing
the number of DRAM channels to 2. From simulation, decreasing
the latency reduced decoding speed by 15% and increasing the
number of DRAM channels to 2 reduced it by 6%.
Figure 11. Different Fetch and Viterbi cache

configuration performance for the single FPGA design
In similar fashion, we compare the caches in our design and

examine the hit rate and BRAM size to determine the best
configuration. With DRAM data width being 256 bits, we set the
line size to conveniently be the same size. For the Fetch and
Viterbi caches, we settled on using 8 BRAMs for each based on Figure 12. % of overall decoding cycles spent stalling for
the hit rate (Figure 11). The Language model requires a much each pipeline stage
deeper depth to achieve an acceptable hit rate because of the large
number of word candidates that get compared. We settled on a With the same buffer and cache experiments as the single FPGA
buffer depth of 2048 or 29 BRAMs for a hit rate of 53%. design, we set each of the Fetch and Viterbi caches to be 8
BRAMs (Figure 13) and 29 BRAMs for the Language model in
With our cache optimized design, the complete single FPGA
our dual FPGA design. Because each cache is now responsible
backend search engine can decode at roughly 6 times faster than
for only a subset of the vocabulary the hit rates for the caches
real-time, decoding 16.8 million HMM states per second running
increase from the single FPGA implementation. The Fetch caches
the 5K WSJ task. When compared to other software recognizers
increase from 93% to 95-97%, Viterbi caches increase from 85%
with similar accuracy, our design can decode roughly 4 to 10
to 86-89%, and 53% to 58% for the Language model cache.
times faster, while running at an order of magnitude slower clock
speed (Table 1). It is interesting to note also that our accuracy is
among the best of the group as well.
Table 1. Comparison of software recognizers and our single

and dual FPGA backend search engine
Speedup
Word Error Clock
Recognizer Engine Over Real
Rate (%) (GHz)
Time
Sphinx 3.0 6.707 2.8 0.59
Sphinx 3.3 7.323 1.02 0.74
Sphinx 4 6.968 2.2 1.6
Single FPGA design 6.725 0.1 6
Figure 13. Different Fetch and Viterbi cache configuration
Dual FPGA design 6.725 0.1 10 performance for the dual FPGA design
Profiling our design, we examined the efficiency of our pipelined The complete dual FPGA design can decode at about 10 times
design by looking at how often one stage is waiting for the next faster than real-time or 28 million states per second. Comparing
stage (Figure 12). The Fetch stage waits the longest for Viterbi to against software recognizers, the performance improvement is 6-
become free, since 1 Viterbi unit serves 2 Fetch units. The Viterbi 17 fold with our design running at 100 MHz.
stage waits on the Transition/Prune/Writeback stage when a
significant number of active HMM entries need to be written back
90
With a faster implementation, the dual FPGA spends more time uses 75% of the overall slices, 68% of the BRAM, and 88MB of
stalling per decoding cycles than the single FPGA design, DRAM. The breakdown of each module derived from Xilinx ISE
although the pipeline stalls comparing the stages remain fairly can be seen in Table 3. The design runs successfully at 10 times
consistent (Figure 12). The discrepancy between the stall cycles faster than real-time with a 100 MHz clock.
for the two partitions of the dual FPGA design is most likely due
to the decoding pattern of the words that each partition is 7. Conclusion
responsible for. A growing set of applications in the domain of audio mining
require extremely high-speed speech recognition. Our central
Table 2. Resource utilization breakdown by module for argument in this paper is that software will always be too slow in
single FPGA design servicing this important emerging area, and that hardware is
Slices (% of # of
always faster. In support of this assertion, we demonstrated the
Module 4 input LUTS critical backend search component of a modern recognizer
total) BRAM
Fetch 4831 (15%) 5347 (8%) 38 running at 10x real-time on a multi-FPGA BEE2 platform,
Viterbi 927 (3%) 1674 (2%) 54 handling a large 5000 word Wall Street Journal benchmark
Trans/Prune/Wb 3048 (9%) 5030 (8%) 2 without loss of accuracy. Our fastest design is 17x faster than the
Language Model 9455 (29%) 13790 (21%) 84 software reference recognizer running on a high-end server, while
DRAM 2655 (8%) 2987 (4%) 14 running at a clock speed roughly 30x slower. To our knowledge,
Other (PPC, CF, the design is the most sophisticated and fastest speech recognition
5057 (15%) 9403 (14%) 49 backend search engine ever to be placed into hardware form.
RS 232, ...)
8. ACKNOWLEDGEMENTS
The DRAM bandwidth requirement is ~50MB/s for the first Special thanks to Tsuhan Chen, Richard M. Stern, and Arthur
channel and 45MB/s for the second to achieve real-time Chan of CMU for their ongoing support for the project. This
processing, with ~66% reads and ~34% writes. To achieve 10 research was supported by the Semiconductor Research Corp., the
times faster than real-time, it gets 950MB/s. Through simulation National Science Foundation, and the Focus Center for Circuit &
when we reduced the DRAM latency to 1, we show the decoding System Solutions (C2S2) one of five research centers funded
speed reduces by 17%. under the Focus Center Research Program, a Semiconductor
Research Corporation program.
6.2 Fast Backend Search Engines on BEE2
Once we verified our system functionally in Modelsim, we 9. REFERENCES
synthesized our Verilog design in ISE 9.1i and extracted a gate- [1] X. Huang, A. Acero and H-W. Hon, “Spoken Language Processing:
level design, and then did a final round of simulation. When we A Guide to Theory, Algorithm and System Development”, Prentice
were confident of correctness, we placed the designs onto the Hall PTR, New Jersey, 2001.
BEE2 board, and ran the complete benchmark. The single FPGA
[2] Leavitt, N. “Let’s Hear It for Audio Mining”, Computer, 35(10):23-
design (including DRAM controller, asynchronous interface, 25, Oct 2002.
Compact Flash, and other peripherals) uses 78 % of the overall
FPGA slices (25973/33088), 73% of the overall 5.8 Mb BRAM [3] Stolzle, A. et al. "Integrated Circuits for a Real-Time Large-
(241/328) and 88MB of DRAM. The breakdown of each module Vocabulary Continuous Speech Recognition System," IEEE Journal
of Solid-State Circuits, vol. 26 no. 1, pp 2-11, Jan 1991.
derived from Xilinx ISE can be seen in Table 2. The design runs
successfully at 6 times faster than real-time with a 100 MHz [4] R. Kavaler et al., A Dynamic Time Warp Integrated Circuit for a
clock. 1000-Word Recognition System", IEEE Journal of Solid-State
Circuits, vol SC-22, NO 1, February 1987, pp 3-14.
The dual FPGA design implementation uses 2 FPGAs and 2
[5] Cali, L., Lertora, F., Besana, M., Borgatti, M., “Co-Design Method
DRAMs. The master FPGA uses 85% of the overall slices, 69% Enables Speech Recognition SoC”, EETimes, Nov. 2001, p. 12.
of the overall BRAM and 88MB of DRAM. The slave FPGA
[6] Mathew, B. Davis, A. and Fang, Z. “A Low-power Accelerator for
Table 3. Resource utilization breakdown by module for the SPHINX 3 Speech Recognition System”, International
our dual FPGA design Conference on Compilers, Architectures and Synthesis for
Embedded Systems, pp 210–219. ACM Press, 2003.
Slices (% of # of
Module 4 input LUTS [7] Krishna, R. Mahlke, S. and Austin, T. “Architectural optimizations
total) BRAM
Master FPGA for low-power, real-time speech recognition”, International
Fetch 4831 (15%) 5347 (8%) 38 Conference on Compilers, Architectures and Synthesis for
Embedded Systems, pp 220–231. 2003.
Viterbi 927 (3%) 1674 (2%) 54
Trans/Prune/Wb 3048 (9%) 5030 (8%) 2 [8] Nedevschi, S. Patra, R. and Brewer, E. “Hardware Speech
Language Model 10379(31%) 16327 (25%) 84 Recognition on Low-Cost and Low-Power Devices," Proc. Design
DRAM 2655 (8%) 2987 (4%) 14 and Automation Conference, 2005.
Other 6288 (19%) 10325 (16%) 32 [9] “The Talking Cure”, The Economist, Mar 12 2005, p. 11.
Slave FPGA
Fetch 4831 (15%) 5347 (8%) 38 [10] Lin, E., Yu. K., Rutenbar, R., Chen, T. "A 1000-Word Vocabulary,
Viterbi 927 (3%) 1674 (2%) 54 Speaker-Independent, Continuous Live-Mode Speech Recognizer
Trans/Prune/Wb 3048 (9%) 5030 (8%) 2 Implemented in a Single FPGA”, International Symposium on Field-
Language Model 8604 (26%) 12954 (20%) 84 Programmable Gate Arrays (FPGA), Feb 2007.
DRAM 2655 (8%) 2987 (4%) 14
Other 4753 (15%) 7405 (16%) 31
91
[11] Chang, C. Wawrzynek, J. and Brodersen, R. W. "BEE2: A High-End [14] Hwang, M. et al., "Subphonetic Modeling with Markov States--
Reconfigurable Computing System," IEEE Design and Test of Senone", International Conference on Acoustics, Speech and Signal
Computers, vol. 22, no. 2, pp114-125, Mar/Apr 2005. Processing, p. 33-36, Mar. 1992
[12] CMU Sphinx Open Source Speech Recognition Engines, [15] Huang, X. D., Ariki, Y., and Jack, M. “Hidden Markov Models for
http://cmusphinx.sourceforge.net/html/cmusphinx.php. Speech Recognition”, Edinburgh University Press, 1990.
[13] Pallett, D., “A Look at NIST’s Benchmark ASR Tests: Past, Present, [16] Viterbi, A.: “Error bounds for convolutional codes and an
and Future”, Proc 2003 IEEE Workshop on Automatic Speech asymptotically optimum decoding algorithm”, IEEE Transactions on
Recognition and Understanding. Information Theory, vol. 13, pp 260-269 1967.
92

p83 Lin

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

p83 Lin

Enviado por

Direitos autorais:

Formatos disponíveis

A Multi-FPGA 10x-Real-Time High-Speed Search Engine

for a 5000-Word Vocabulary Speech Recognizer

ABSTRACT deliver results at real-time speed: each hour of audio requires a

2. SPEECH RECOGNITION OVERVIEW

2.4 Language Model backend search by profiling CMU Sphinx 3.0.

Fetch: Much like an instruction fetch unit of a CPU, the Fetch

Figure 8. Optimized (single FPGA) datapath

With our optimized datapath, pipelining and parallelism can cause

Figure 11. Different Fetch and Viterbi cache

In similar fashion, we compare the caches in our design and

Table 1. Comparison of software recognizers and our single

Você também pode gostar