Escolar Documentos
Profissional Documentos
Cultura Documentos
83
The most common architecture for a modern recognizer (Figure 1) Once speech is input to a microphone and sampled by an analog
comprises three stages: feature extraction, acoustic scoring, and to digital converter, it is organized into 10 ms blocks called
backend search. Likewise, we envision a hardware-based frames. Each frame is converted from a time domain signal to
recognizer as comprising a 3-stage pipeline of hardware frequency domain, where more unique features of speech can be
implementations of each of these essential steps. As we shall detected and analyzed in the feature extraction stage. From there,
show, in the large-vocabulary case, the backend search stage is the a feature vector is generated containing all the acoustic
most limiting bottleneck for the overall design. Thus, in this information for the frame, which is passed on to the acoustic
paper, we demonstrate a fully functional backend search engine scoring stage. Scoring matches the sound “heard” in each frame
implemented in FPGA form. against a large library of atomic sounds. Mechanically, this is
usually done using Gaussian mixture models (GMM). From these
sound probabilities, the backend search stage finds the most
probable word sequence by gluing together sounds using their
probabilities. In the following subsections we elaborate on
algorithms in the backend search stage.
2.1 Acoustic Model
After each frame of speech has produced sound probabilities, the
backend search uses a two-layer model to associate sounds to
Figure 1. Speech recognition decoding flow word sequences [1]. In the first layer, an acoustic model maps
sounds to words. The granularity for which speech recognition
To validate our architecture, we implemented it on a Berkeley systems examine sounds is at the sub-word unit called phonemes.
Emulation Engine 2 (BEE2) FPGA platform [11], and took Depending on language, the number of phonemes can differ, for
advantage of its large FPGA resources and high-speed DRAM example the English vocabulary has approximately 50 unique
bandwidth. For performance comparisons, we used the open phonemes. When these phonemes are pronounced, the acoustic
source CMU Sphinx 3.0 recognizer software [12]. We used a 5K realization of each sound, called the phone, varies depending on
word vocabulary Wall Street Journal (WSJ) [13] model and its context. Recognition systems model these effects by taking
benchmark to evaluate performance. We were able to achieve a into consideration the phones directly preceding and following the
roughly 10x speedup over real-time, with negligible loss of phone, commonly referred to as triphones. To more accurately
accuracy when compared to the software reference model. To the model triphones in speech, sub-phonetic sounds are considered,
best of our knowledge, this is both the most complex, and the known as states. In Sphinx 3.0, each triphone is made up of 4
fastest recognizer search engine ever to be realized in hardware. states, 3 of which are associated with sounds, and a 4th state
representing an end-of-triphone position or a null state. Because
The rest of this paper is organized as follows. In Section 2 we there are only a limited number of sounds that can be produced,
briefly discuss the architecture of a modern speech recognizer and the states are clustered into a smaller, more manageable set of
the key algorithms used. We discuss in particular our reference atomic sounds called senones [14]. Hierarchically, words are
architecture, CMU Sphinx 3.0 with focus on the backend search broken up into triphones and triphones are broken up into states
stage. In Section 3, we profile and analyze the performance of (senones), and for each senone, a probability is generated for each
CMU Sphinx 3.0’s backend search stage by examining decoding frame from the acoustic scoring unit.
speed. Section 4 describes the datapath and memory
optimizations necessary to achieve a fast backend search engine.
Section 5 describes a sequence of two hardware designs: a single
FPGA and then a dual FPGA prototype. Section 6 gives
experiments results from these fully functioning backend search
engines. Finally, Section 7 offers concluding remarks.
84
notion that speech over time can either remain in the same state or 2.5 Backend Search
continue down the phonetic chain. The backend search stage can be divided into three distinct phases
2.3 Viterbi Beam Search for a given frame of speech based on the algorithms discussed
To find the most likely word sequence, HMMs are searched using above. The first phase handles Viterbi calculations of active
the Viterbi algorithm [16], a time-synchronous dynamic HMM states, where senone scores from the acoustic scoring are
programming technique. For each frame of speech, the Viterbi used to update active states or transition to new states within a
algorithm updates all state probabilities (scores) with its current phone (within triphone transition). The second phase includes
observation (senone) probability and state transition probability. within word transitions (activations of new triphones) and beam
At the end of each frame, only the highest probability of the pruning on improbable triphones. Finally, the last phase uses the
successor state is maintained and is considered to be the best path n-gram model to spawn new words when words are completed
to that state. The equation for the best path to successor state j at and a cross word transition is applied. In the next section, we
time t from all preceding states i can be summarized in Equation. present a deep examination of the computational requirements for
1.
Pt(j) = max[Pt-1(i)aij]bj(Ot) for all states that transition to j (1)
where aij is the transition probability from state i to state j, bj is the
observation probability of state j, and Ot is the observation at time
step t. However, just knowing the final best path solution is not
enough to find the most likely word sequence, as a history of
transitions is needed to maintain when each word was uttered.
By searching through and updating every state of every frame in
the vocabulary, the Viterbi algorithm is guaranteed to find the
most likely state sequence. However, this search space for large
vocabulary systems is vastly too large for even high-end
processors to decode in reasonable time. As a result, pruning of
less probable states is used to reduce the search space after each
time step; this is commonly referred to as beam pruning. States
that are unpruned after this stage are referred to as active.
Although beam search may reduce recognition accuracy, the time Figure 3. Profiles of the 1K RM, 5K WSJ, and 60K
saved by using it is significant. HUB4 models
85
make cross word transitions, triggering the n-gram language We did a more detailed study of Sphinx 3.0’s backend search by
model where a list of word candidates is searched for word examining the decoding speed and breaking down the
activation. The computation is moderate, memory bandwidth is computation time into sub-stages: Viterbi, Transition/Pruning, and
large, and memory access pattern noncontiguous. The processing Language model (Figure 5). The results show the Language
time is largely dependent on the number of active states and sizes model and Viterbi stages consume the most time, and for larger
of language model word candidate lists searched each frame. vocabulary sizes, the time spent in the Language model
proportionally increases. When designing the hardware
architecture, we pay particular attention to speeding up the two
stages.
4. CUSTOM HARDWARE BACKEND
SEARCH ENGINE
Our goal for our fast speech recognition backend search engine is
to achieve a considerable speedup over software systems running
on high end processors, while using fewer resources and running
at a fraction of the clock speed. Our target platform for our
design is the BEE2 FPGA board which contains 5 Xilinx Virtex-II
Pro XC2VP70 FPGAs and 4 DDR2 DRAM channels for each
FPGA. In this section, we go over our approach to developing our
architecture and introduce our custom datapath and memory
system developed for the BEE2 platform. We use decoding speed
Figure 4. % of time spent on each stage of decoding to measure speedup and word error rate (WER) to measure
running CMU Sphinx 3.0 accuracy as metrics to evaluate speech recognition performance.
We ran the three benchmarks on a 2.8 GHz Xeon processor with 4.1 Approach
2GB of memory, monitoring decoding speed of each stage (Figure Because software algorithms do not always map well to hardware,
4). The experiment revealed that the compute time for feature several modifications were made to the CMU Sphinx 3.0
extraction constituted less than a percent of overall processing, algorithms to speed up our hardware implementation as discussed
while time spent on acoustic scoring grew linearly with the total in [10]. No further functional modifications have been made to
number of senones. Backend search dominated computation time the algorithms. To account for the changes we generated a
with larger vocabulary sizes due to the high number of active software backend search engine which matched our custom
states and language model word candidates. Based on the hardware design algorithmically so verification of word accuracy
profiling, we set our sights on speeding up larger vocabulary tasks and design correctness could be made quickly and thoroughly.
(at least 5K) since smaller benchmarks already easily run in real- Afterwards, we extracted memory structures from the software
time on general purpose CPUs. We focus on the backend search recognizer to generate the DRAM memory for our FPGA design.
stage because of its long processing time and frame-to-frame Then, we designed the baseline custom hardware architecture in
computation variability. Hardware is also appealing to backend Verilog HDL and used Mentor Graphics Modelsim to functionally
search as it can significantly benefit from a custom datapath and verify correctness and provide cycle count information. From
memory hierarchy and organization with its large, irregular there, we synthesized the design in Xilinx ISE/EDK and placed it
memory access patterns. The other stages, while important, have on the BEE2 for validation. Once verified to be correct, we
clearer paths to high performance with custom hardware: feature optimized for speed to fully utilize a single FPGA on the board.
extraction needs little computational resources and acoustic To further increase speedup, we split our design across two
scoring (GMM) is inherently parallelizable as mentioned in [10], FPGAs on the board to get access to more area. In doing so, we
so long as it has adequate memory bandwidth. had to design a sophisticated interface to reduce communicate
overhead between the FPGAs but still maintain correct
functionality.
4.2 Architecture
Our architecture evolved from a straightforward serial
implementation of the algorithms to an extremely parallel and
pipelined design with multiple HMM entries (4-state HMMs) in-
flight at any given time. In this section we give an overview of
development of the high performance speech recognition backend
search engine.
We began our pursuit towards a high performance engine by
creating a baseline datapath to lay the foundation for which our
optimizations would be built upon. The baseline datapath is
broken up into 4 stages, Fetch, Viterbi,
Transition/Writeback/Prune, and Language model. The roles of
Figure 5. % of time spent in each backend search stage each stage are as follows (Figure 6):
86
One of the most challenging aspects of the backend search engine
is how to effectively partition where its large data set should be
stored, whether in BRAM or DRAM. Because different tasks
have many varying size memory structures, we opted to store
everything in DRAM unless it was guaranteed that the data
structure (such as senone scores) could be completely stored in
BRAM regardless of task size. Unfortunately because memory
plays such a critical role in the performance of the stage, we look
towards caches and buffers to help speed the design up in our
optimized implementation.
After using our baseline datapath to estimate the area and timing
requirements for our backend search engine to run on a BEE2
FPGA, we developed a performance driven design running on a
single FPGA with several key optimizations including pipelining,
parallelism, prefetching, and buffering.
Figure 6. Baseline backend search datapath
87
requests, the active HMM entries for a frame can be quickly stored in BRAM and the map table data hitting often in the cache,
serviced by the DRAM all at once. the Viterbi stage typically does not require going to DRAM.
Buffering: In our design we use buffering to reduce stalls created Language model cache: In the Language model, words that make
by memory and by busy modules. For example in the Language cross word transitions do not change much over time; therefore a
model stage, n-gram word candidates from DRAM are continually cache to maintain all possible word candidates is extremely
fetched and buffered as Word Enter modules determine whether beneficial so long as the cache is large enough to cover the data
they should be stored into the Patch List or not. Stored set. Rarely is the working set larger than the cache, thus the
contiguously in memory, retrieving the word candidates and module does a good job of reducing memory access times.
storing them in BRAM helps reduce memory latency.
While our single FPGA implementation can achieve a
The complete optimized single FPGA design is shown in Figure considerable speedup, further optimizations can be made to
8, displaying the parallelization and pipelining of each stage. In increase performance with more area. With the flexibility of
the diagram, Fetch 0 and Transition/Prune/Writeback 0 takes care BEE2, we have the chance to create an even higher performance
of words starting with ‘A’ through ‘L’, and the other partition design by expanding our datapath across two FPGAs.
takes care of words that start with ‘M’ through ‘Z’.
We decided the best way to achieve the highest performance was
to replicate the single FPGA design onto the second FPGA, such
that one FPGA is responsible for creating, updating, and pruning
HMMs for a subset of the vocabulary while the other is
responsible for the rest of the words. We use one FPGA as the
master, which dictates when events should occur, and the other
FPGA as a slave, which sends and receives data according to what
the master FPGA’s command (Figure 9). With board level
communication slow and bitwidth restricted, we tried to design
the system with as little communication across the board as
possible, while achieving high performance.
88
Much like the datapath, the memory is almost an exact replica of DRAM controller. The asynchronous DRAM interface can
the single FPGA memory where the complete design uses twice transfer at most 256 bits of data each cycle to fully utilize the
the BRAM and two DRAM channels. channel. The interface has pipelined requests, such that there can
be multiple outstanding requests. For data that is random access,
5. BACKEND SEARCH ENGINE maximally pipelined reads are handled roughly once every 5
IMPLEMENTATION 100MHz cycles (for a total of 640MB/s). The latency for a single
We discuss several detailed system implementation issues, for read request is 18-19 cycles.
how we mapped our architecture onto the BEE2 platform. Before decoding begins, the recognizer requires an initialization
period to move static model data (all the word information for
5.1 Implementation Framework HMM and probabilities for n-gram word candidates) from the
The BEE2 platform has large FPGA resources and can support Compact Flash card to the DRAM for each of the FPGAs (Figure
high memory bandwidth. We use it as a proof of concept to 10). The data is transferred using a PowerPC core and through
demonstrate our high performance architecture.
our recognizer to DRAM. With the dual FPGA implementation,
The BEE2 platform has 5 FPGAs, 1 control and 4 user. The the data is transferred from the control FPGA to the user FPGAs
control FPGA has direct access to peripherals such as Ethernet, through board level communication to the user FPGA’s DRAM.
RS232, and CompactFlash, and direct links to all user FPGAs. Once the data is transmitted, control is then passed to the user
Each user FPGAs has direct links with 2 neighboring user FPGAs FPGAs to access their own DRAM.
as well as the control FPGA. In our setup, we load the FPGA
bitstreams directly from CompactFlash. One of the PowerPC 5.3 Board-level communication
cores on the control FPGA is used to support pre-decoding
memory transfers (loading models into DRAM) and I/O
(displaying data through RS232). While the processor is used to
support the backend search engine, the engine itself is
implemented in complete custom hardware.
For our single FPGA implementation, we set a target of 100MHz
for the clock rate. We design the backend search engine to run on
the control FPGA to refrain from the use of board level
communication. Our testing method consists of loading senone
scores for a frame of a speech, starting the backend search engine, Figure 10. Accessing DRAM 1) Compact Flash loads data
and waiting till the stage completes before loading the scores for into DRAM through PowerPC and our backend search core
the next frame of speech until all the frames are completed. To 2) backend search DRAM access path, no use of the PLB
verify functionality of each frame of speech, we set up an
interface from the PowerPC to read from DRAM to compare the
scores for each active HMM entry. As a sanity check we also To communicate between FPGAs, we used inter-chip links which
keep track of the number of active HMM entries written for each use a 200MHz clock to transfer data with no support by the
frame. Once all the frames have been decoded, the word PowerPC. For our dual FPGA design, we use the inter-chip
sequence is displayed through RS232. interface provided in the BEE2 library, which uses 138 pins, 69
We set the dual FPGA implementation to also run at 100MHz. bits each way from user FPGA to user FPGA. To load static data
Unlike the single FPGA system, we placed our design on two user and do testing we use the control FPGA, which uses 64 pins, 32
FPGAs in anticipation of the design being used in a complete bits each way. The inter-chip links allow for pipelined data
speech recognition system. With one of the control FPGA’s transfers so the maximum bandwidth for the user FPGAs is
PowerPC, we load the user FPGAs’ bitstreams through 3.5GB/s and from the control FPGA to a user FPGA is 800MB/s
SelectMAP (a low bandwidth communication channel). For both ways. We were more worried about latency so we made sure
testing, we use the control FPGA to load senone scores and start it was not going to be a limiting factor by testing the Inter-chip
the decoding of a speech frame to both user FPGAs. Once all the modules and determining that the latency was 5 200MHz cycles
frames are completed, the control FPGA requests the master user from one FPGA to the other. To interface between the different
FPGA for the word sequence where it is displayed through clock domains we used an asynchronous FIFO to pass data.
RS232. We also create an interface through the control FPGA
that allows the PowerPC to read DRAM data from both user
6. EXPERIMENTAL RESULTS
In this section we discuss our simulation results and discuss the
FPGAs to verify the active HMM entries are being decoded
area and memory requirements for our single and dual FPGA
correctly.
implementation of a fast speech recognition backend search
5.2 DRAM engine.
On the BEE2, each FPGA is connected to 4 DDR2 SDRAMs of
64 bit wide data. But more telling is the speed in which we can 6.1 Hardware Simulator Results
retrieve data; each DRAM channel has a maximum bandwidth of Before moving to the BEE2, we simulated our design thoroughly
25.6 Gb/s or 3.2GB/s (the DDR2 controller runs at 200 MHz). across the 5K WSJ task (330 utterances for a total of 40 minutes
However to achieve such data rates, the data must be stored in of speech) in Modelsim to estimate decoding speed and decide the
consecutive addresses in DRAM. Since we plan to run our design optimal number of BRAM to use for caches and buffers. We also
at 100 MHz, we used the BEE 2 library’s asynchronous DRAM looked at the efficiency of our datapath and examine DRAM
interface, which allows our design to interface with the faster utilization.
89
With BRAM size limited in configurations, we analyzed how the to DRAM. The Transition/Prune/Writeback stage does not have
size of the buffers and caches affect the performance of our single to wait long for he Language model because of the completed
FPGA design. Each entry in the Active HMM Queue Buffer is word buffer which absorbs the stalls. We note that while another
256 bits, requiring a baseline cost of 8 BRAMs. For sizes up to Viterbi unit would provide better performance, the area of one of
512 deep the number of BRAM needed remains the same. While the FPGAs was not large enough to support it and still meet the
a deeper buffer can provide better performance, the speedup 100MHz clock constraint.
degrades with buffer depth and stops when all the active HMM
entries can fit in the buffer. Through simulation we find that the The DRAM bandwidth requirement for our design is ~93MB/s for
ideal BRAM size is 8. real-time processing, with ~66% reads and ~34% writes. For our
design, which achieves 6 times faster than real-time, it gets
558MB/s. We examined whether there is a memory bottleneck in
the system, by reducing the DRAM latency to 1 and by increasing
the number of DRAM channels to 2. From simulation, decreasing
the latency reduced decoding speed by 15% and increasing the
number of DRAM channels to 2 reduced it by 6%.
Profiling our design, we examined the efficiency of our pipelined The complete dual FPGA design can decode at about 10 times
design by looking at how often one stage is waiting for the next faster than real-time or 28 million states per second. Comparing
stage (Figure 12). The Fetch stage waits the longest for Viterbi to against software recognizers, the performance improvement is 6-
become free, since 1 Viterbi unit serves 2 Fetch units. The Viterbi 17 fold with our design running at 100 MHz.
stage waits on the Transition/Prune/Writeback stage when a
significant number of active HMM entries need to be written back
90
With a faster implementation, the dual FPGA spends more time uses 75% of the overall slices, 68% of the BRAM, and 88MB of
stalling per decoding cycles than the single FPGA design, DRAM. The breakdown of each module derived from Xilinx ISE
although the pipeline stalls comparing the stages remain fairly can be seen in Table 3. The design runs successfully at 10 times
consistent (Figure 12). The discrepancy between the stall cycles faster than real-time with a 100 MHz clock.
for the two partitions of the dual FPGA design is most likely due
to the decoding pattern of the words that each partition is 7. Conclusion
responsible for. A growing set of applications in the domain of audio mining
require extremely high-speed speech recognition. Our central
Table 2. Resource utilization breakdown by module for argument in this paper is that software will always be too slow in
single FPGA design servicing this important emerging area, and that hardware is
Slices (% of # of
always faster. In support of this assertion, we demonstrated the
Module 4 input LUTS critical backend search component of a modern recognizer
total) BRAM
Fetch 4831 (15%) 5347 (8%) 38 running at 10x real-time on a multi-FPGA BEE2 platform,
Viterbi 927 (3%) 1674 (2%) 54 handling a large 5000 word Wall Street Journal benchmark
Trans/Prune/Wb 3048 (9%) 5030 (8%) 2 without loss of accuracy. Our fastest design is 17x faster than the
Language Model 9455 (29%) 13790 (21%) 84 software reference recognizer running on a high-end server, while
DRAM 2655 (8%) 2987 (4%) 14 running at a clock speed roughly 30x slower. To our knowledge,
Other (PPC, CF, the design is the most sophisticated and fastest speech recognition
5057 (15%) 9403 (14%) 49 backend search engine ever to be placed into hardware form.
RS 232, ...)
8. ACKNOWLEDGEMENTS
The DRAM bandwidth requirement is ~50MB/s for the first Special thanks to Tsuhan Chen, Richard M. Stern, and Arthur
channel and 45MB/s for the second to achieve real-time Chan of CMU for their ongoing support for the project. This
processing, with ~66% reads and ~34% writes. To achieve 10 research was supported by the Semiconductor Research Corp., the
times faster than real-time, it gets 950MB/s. Through simulation National Science Foundation, and the Focus Center for Circuit &
when we reduced the DRAM latency to 1, we show the decoding System Solutions (C2S2) one of five research centers funded
speed reduces by 17%. under the Focus Center Research Program, a Semiconductor
Research Corporation program.
6.2 Fast Backend Search Engines on BEE2
Once we verified our system functionally in Modelsim, we 9. REFERENCES
synthesized our Verilog design in ISE 9.1i and extracted a gate- [1] X. Huang, A. Acero and H-W. Hon, “Spoken Language Processing:
level design, and then did a final round of simulation. When we A Guide to Theory, Algorithm and System Development”, Prentice
were confident of correctness, we placed the designs onto the Hall PTR, New Jersey, 2001.
BEE2 board, and ran the complete benchmark. The single FPGA
[2] Leavitt, N. “Let’s Hear It for Audio Mining”, Computer, 35(10):23-
design (including DRAM controller, asynchronous interface, 25, Oct 2002.
Compact Flash, and other peripherals) uses 78 % of the overall
FPGA slices (25973/33088), 73% of the overall 5.8 Mb BRAM [3] Stolzle, A. et al. "Integrated Circuits for a Real-Time Large-
(241/328) and 88MB of DRAM. The breakdown of each module Vocabulary Continuous Speech Recognition System," IEEE Journal
of Solid-State Circuits, vol. 26 no. 1, pp 2-11, Jan 1991.
derived from Xilinx ISE can be seen in Table 2. The design runs
successfully at 6 times faster than real-time with a 100 MHz [4] R. Kavaler et al., A Dynamic Time Warp Integrated Circuit for a
clock. 1000-Word Recognition System", IEEE Journal of Solid-State
Circuits, vol SC-22, NO 1, February 1987, pp 3-14.
The dual FPGA design implementation uses 2 FPGAs and 2
[5] Cali, L., Lertora, F., Besana, M., Borgatti, M., “Co-Design Method
DRAMs. The master FPGA uses 85% of the overall slices, 69% Enables Speech Recognition SoC”, EETimes, Nov. 2001, p. 12.
of the overall BRAM and 88MB of DRAM. The slave FPGA
[6] Mathew, B. Davis, A. and Fang, Z. “A Low-power Accelerator for
Table 3. Resource utilization breakdown by module for the SPHINX 3 Speech Recognition System”, International
our dual FPGA design Conference on Compilers, Architectures and Synthesis for
Embedded Systems, pp 210–219. ACM Press, 2003.
Slices (% of # of
Module 4 input LUTS [7] Krishna, R. Mahlke, S. and Austin, T. “Architectural optimizations
total) BRAM
Master FPGA for low-power, real-time speech recognition”, International
Fetch 4831 (15%) 5347 (8%) 38 Conference on Compilers, Architectures and Synthesis for
Embedded Systems, pp 220–231. 2003.
Viterbi 927 (3%) 1674 (2%) 54
Trans/Prune/Wb 3048 (9%) 5030 (8%) 2 [8] Nedevschi, S. Patra, R. and Brewer, E. “Hardware Speech
Language Model 10379(31%) 16327 (25%) 84 Recognition on Low-Cost and Low-Power Devices," Proc. Design
DRAM 2655 (8%) 2987 (4%) 14 and Automation Conference, 2005.
Other 6288 (19%) 10325 (16%) 32 [9] “The Talking Cure”, The Economist, Mar 12 2005, p. 11.
Slave FPGA
Fetch 4831 (15%) 5347 (8%) 38 [10] Lin, E., Yu. K., Rutenbar, R., Chen, T. "A 1000-Word Vocabulary,
Viterbi 927 (3%) 1674 (2%) 54 Speaker-Independent, Continuous Live-Mode Speech Recognizer
Trans/Prune/Wb 3048 (9%) 5030 (8%) 2 Implemented in a Single FPGA”, International Symposium on Field-
Language Model 8604 (26%) 12954 (20%) 84 Programmable Gate Arrays (FPGA), Feb 2007.
DRAM 2655 (8%) 2987 (4%) 14
Other 4753 (15%) 7405 (16%) 31
91
[11] Chang, C. Wawrzynek, J. and Brodersen, R. W. "BEE2: A High-End [14] Hwang, M. et al., "Subphonetic Modeling with Markov States--
Reconfigurable Computing System," IEEE Design and Test of Senone", International Conference on Acoustics, Speech and Signal
Computers, vol. 22, no. 2, pp114-125, Mar/Apr 2005. Processing, p. 33-36, Mar. 1992
[12] CMU Sphinx Open Source Speech Recognition Engines, [15] Huang, X. D., Ariki, Y., and Jack, M. “Hidden Markov Models for
http://cmusphinx.sourceforge.net/html/cmusphinx.php. Speech Recognition”, Edinburgh University Press, 1990.
[13] Pallett, D., “A Look at NIST’s Benchmark ASR Tests: Past, Present, [16] Viterbi, A.: “Error bounds for convolutional codes and an
and Future”, Proc 2003 IEEE Workshop on Automatic Speech asymptotically optimum decoding algorithm”, IEEE Transactions on
Recognition and Understanding. Information Theory, vol. 13, pp 260-269 1967.
92