Você está na página 1de 5

Distributed Speech Recognition for

Information Retrieval on Mobile Devices


Tom Brndsted, Lars Bo Larsen, Brge Lindberg, Morten Rasmussen,
Zheng-Hua Tan, Haitian Xu
Speech and Multimedia Communication (SMC), Department of Communication Technology
Aalborg University, Denmark

{tb,lbl,bli,mr,zt,hx}@kom.aau.dk

ABSTRACT
This paper presents a distributed speech recognition (DSR)
system for information retrieval on mobile devices. The
overall prototype system applies the state-of-the-art DSR
technique and knowledge-based Information Retrieval (IR)
processing for spoken question answering. A configurable
DSR system is implemented on the basis of the ETSI-DSR
advanced front-end and the SPHINX IV recognizer. Acoustic
modeling for the Danish language and language modeling for
a soccer test domain are presented in detail. Though the
prototype system can only answer queries and questions
within the chosen domain, the system has the flexibility for
being ported to other domains.

Categories and Subject Descriptors


C.3 SPECIAL-PURPOSE AND APPLICATION-BASED
SYSTEMS Microprocessor/microcomputer applications,
Real-time and embedded systems, Signal processing systems

General Terms
Distributed Speech recognition, Intelligent Search Engines,
Language models, Keywords are your own designated
keywords.

Keywords
Distributed Speech recognition, Intelligent Search Engines,
Language models.

conventional search engines based on e.g. the vector-space


model (i.e. it returns fewer and more relevant webdocuments), provided that the documents to be retrieved
are written in a certain language (in our case Danish) and
belong to a certain domain. For the test application the
domain of soccer has been chosen [2].
Similarly, the DSR-system is of course languagedependent (as it employs acoustic models trained on
Danish speech), and the acoustic search is currently
constrained by a grammar designed explicitly for the
domain. Alternative language modeling has been analyzed
by comparing various text corpora.
Other systems with a similar aim are e.g. [13], which is
built from commercial components, and a system for
retrieval of Chinese (Mandarin) broadcast news [14], [15].
The SmartWeb [16], [17] project is an effort to access
semantic web information using spoken queries.
This paper sets focus on the second component of the
overall system, the DSR system. It is organized as follows.
In Section 2, the systems architecture is presented.
Section 3 describes the speech recognizer. Conclusions are
presented in Section 4.

2. SYSTEM ARCHITECTURE
The system employs a fully distributed architecture that
includes both the speech recognizer and the IR system. The
overall architecture of the system is depicted in Fig. 1.

1. INTRODUCTION
Distributed Speech Recognition (DSR) has a wide range of
applications because of its advantages both in reducing the
computational requirements and power consumption for
devices at the client side and in facilitating the effortless
update of the core part of the recognizer at the server side [1].
The current paper applies DSR in accessing IR services on
remote servers from mobile devices such as Personal Digital
Assistants (PDAs) and mobile phones. In collaboration with
the Software Intelligence and Security Research Centre (SISRC), Esbjerg, Denmark, a prototype system [2] has been built
employing two main components: 1) An IR-system using
sophisticated IR techniques with a specialized question
answering engine and 2) a DSR-system [3] implemented on
the basis of the ETSI-DSR advanced front-end [4] and the
SPHINX IV recognizer [5].
The IR-engine has been designed to perform better than

Figure 1: Architecture of the speech-enabled


information retrieval system.
The speech recognizer is implemented using the DSRscheme where the speech recognition processing is split

into the client-based DSR front-end feature extraction and the


server-based DSR back-end recognition on a standard PC [1].
Speech features are transmitted from the DSR client to the
DSR server. Subsequently, the output in the form of either the
best result or an N-best list is sent from the remote server back
to the client and from there to the IR server. In order to enable
the configuration of the DSR back-end recognizer, commands
are transmitted back and forth between the DSR client and the
DSR server. The distributed speech recognizer is described in
Section 3.
The IR server is described in more detail in [2]. It receives
the users questions about the chosen soccer test domain in
text form and determines whether the question should be sent
to the search engine or can be answered directly using
domain-dependent
Information
Extraction
(IE)-like
techniques [6]. The server continuously updates the
knowledge base by retrieving sports news via the WWW in
the form of RSS articles from diverse news providers [7].
These documents are stored in the database for retrieval.
Figure 2 shows the application user interface.

processed by the advanced front-end server-side module.


First, on the detection of transmission errors, error
concealment is conducted for feature reconstruction. Then,
the error-corrected speech feature packages are decoded
into a set of cepstral features and VAD information.
Subsequently, the cepstral features are processed by the
SPHINX IV speech recognizer. The recognizer presents its
result (either the best or N-best results) at the utterance end,
detected by the VAD information, and transmits back to the
client. To increase system usability and flexibility, three
typical recognition modes are represented, namely:
Isolated word recognition, Grammar based recognition
and Large vocabulary recognition. Each is defined by a
set of prototype files at the server side. The setting may be
different across a group of end-users.

Figure 3: The DSR architecture.

Figure 2: Screen shot of the soccer IR service. Input (top


left field) can be entered either by voice or by stylus.
Search results are presented as text and shown below.

3. DSR SYSTEM
This section describes the DSR system, acoustic modeling and
language modeling.

3.1 Distributed Architecture


As illustrated in Fig. 3, the DSR system [3] is developed on
the basis of the ETSI-DSR advanced front-end (AFE) [4], [8]
and the SPHINX IV recognizer [5]. The advanced front-end
client-side module extracts noise-robust Mel-Frequency
Cepstral Coefficient (MFCC) features which together with
Voice Activity Detection (VAD) information are encoded
sequentially and packed into speech feature packages for
network transmission.
At the server side the received speech packages are

A Command Processor is implemented at both the


client and server side to support the interchange of
configuration commands. Potential commands include
control commands to start or stop recognition, choice of
recognition mode, commands providing feedback
information from the server to a client (e.g. success or
failure of any user request), etc.
The DSR client has been evaluated on a H5550 IPAQ
with a 400MHz XScale CPU and 128 MB memory. With
speech data sampled at 8 kHz the client is able to conduct
the speech processing 0.82 times real-time.

3.2 Acoustic modeling


The acoustic models used in the system are created as part
of an ongoing effort to create a Large Vocabulary
Continuous Speech Recognition (LVCSR) system for
Danish on the basis of the SPHINX IV speech recognition
platform. The speech data set consists of 44 hours of
recorded sentences selected from the SpeechDat(II)
speech database [9], which contains telephone quality
speech sampled at 8 kHz stored in the A-law format. The
training of the models is performed by using SphinxTrain
and the SPHINXIV front-end.

Table 1: The WER at each step of improvement.


Methods
Baseline system
Training material
Filler models
Force alignment
Increase bandwidth
Gender dependent models

WER
37.6 %
37.4 %
35.2 %
34.7 %
33.0 %
31.9 %

When recognizing 9260 words (843 sentences) from a


general domain using the female models, the WER is 30.5%.
Similarly the WER when recognizing 7223 words (676
sentences) using the male models is 33.8%. These tests run at
0.7 times real-time on a P4 2.8 GHz computer.

3.3 Vocabulary and grammar


The target domain for the recognition task is Danish soccer
news. The current version of the system employs a rule-

grammar for modeling user questions. The rule grammar


requires very fixed sentence structures and takes only a
subset of the questions into account that can actually be
handled by the IR/IE-component of the system. The
vocabulary size is around 700 including a little more than
400 player and team names. As player and team names are
handled as two syntactic classes (implying that all players
and all teams are syntactically equivalent and have equal
likelihoods), the perplexity of the grammar is high in spite
of the small-sized vocabulary.
Applying the rule grammar described above for
recognition a sentence accuracy of 90% is achieved when
testing on 260 sentences (approximately 14 minutes)
recorded by one person using a Plantronics M2500
Bluetooth headset and using the Advanced DSR Front-end
[8] for both training and test. It should however be noted
that it is also possible to use the embedded microphone in
the PDA as input device.

3.4 Alternative language modeling


Alternative linguistic search-space constraints based either
on a more elaborated rule grammar or on conventional
statistical language modeling is under consideration. For
this purpose, the documents making out the collection of
the IR-search engine (called the DB in Fig. 1) has been
analyzed to clarify the extent to which they have the
characteristics of a distinct sublanguage (relatively closed
vocabulary etc.). Though, there is no simple 1:1-relation
between these documents and the input expected by the
system from the user, certain issues (unigram, bigram, and
trigram statistics) can be considered largely coherent since
the documents have been collected with the expectation
that they contain answers to the questions asked by the
users.
In IE, the term sublanguage [12] denotes a language
restricted to a certain semantic (typically "technical")
domain and characterized by absence of lexical ambiguity,
a relatively closed vocabulary, and more. Traditionally the
language encountered in weather reports is considered a
distinct sublanguage. Fig. 4 depicts the number of unique
words (unigrams) observed as a function of "running
text", i.e. the total number of words, encountered for three
text corpora: 1) sentences from the Danish Korpus2000
corpus, 2) the documents making out the collection of the
IR/IE-component of the current system, and 3) 48 daily
weather reports sampled February-March 2003.
14k
Korpus2000
Sports news
Weather reports

12k

Unique unigrams

The models trained are word-internal and between-word


triphones based on 61 phones (including diphthongs). These
triphones are modeled by three state left-to-right continuous
density Hidden Markov Models (HMM) with 16 Gaussians
per state. The triphones are trained by building binary
decision trees used for clustering similar triphone states in
order to reduce the number of parameters to be trained. The
resulting total number of clusters (or states) is approx. 5000.
This improves the training of the triphones with limited
amount of training data.
The Language Model (LM) used is a statistical trigram
model, trained on the Korpus2000 corpus (20M words) [10]
and the SpeechDat(II) text corpus (160k words) using GoodTuring discounting. One LM is first trained on each corpus
and then the generated LMs are mixed to minimize the cross
perplexity with a development text set [11]. Both Korpus2000
and SpeechDat(II) are very general corpora containing texts
with various styles and topics. Korpus2000 is described in
Section 3.4 in more detail.
The above setting results in a baseline system where a
word error rate (WER) of 37.6% is obtained when tested on
16483 words (1519 sentences) from a general domain.
Towards improving the performance step by step, a
number of experiments have been made [18].
1) Add word phrases, which essentially are sentences, to
the training material, resulting in 48 hours of speech for
training.
2) Train models for filled pauses (speaker saying umm,
etc.) and speaker noise (cough, etc.) and include training
material containing intermediate noise (door slam, etc.),
which was previously excluded, resulting in 62 hours of
training material. These context-independent models are also
modeled by three state left-to-right continuous density HMMs
with 16 Gaussians per state.
3) Force-align transcriptions in order to select alternative
pronunciation of words and to insert silence and speaker noise
markers into the transcriptions used for training.
4) Increase the bandwidth when creating the cepstral
coefficients from 300Hz-3400Hz to 150Hz-3600Hz.
5) Create gender dependent models using only 3000
clustered states.
The WER for each step is shown in Table 1.

10k
8k
6k
4k
2k
0

10k

20k

30k
40k
Unigrams

50k

60k

Figure 4: Number of unique words (unigrams) as a


function of the total number of words in 1) Korpus2000,
2) collection of sport news, and 3) weather reports.

Korpus2000 consists of sentences in random order taken


from various sources such as news papers, publishers,
schools, and various organizations [10]. As seen from the
Fig. 4, there is in Korpus2000 no flattening tendency in the
number of unique words observed when processing the first
60000 words in this corpus (which corresponds the amount
of words currently available for sports). On the contrary, the
weather reports very quickly (after approximately 5000
words, not easily depicted in the figure) shows a flattening
tendency - as expected for a distinct sublanguage. The
wording in such reports is stereotype ("light to moderate",
"moderate to fresh", "south to south-east", "east to northeast", etc.), and even a similar curve depicting the number of
unique bigrams as a function of the total number of words
shows a slightly flattening tendency after about 35000 words.
Using Korpus2000 and the weather reports as references, we
can classify the sports news as somewhere between a distinct
sublanguage and a heterogeneous and general Danish
language usage, unfortunately closer to the latter than to the
former.
This tendency gets even more significant when we
compute bigram-statistics for the three corpora (Fig. 5)
similar to the unigram statistics in Fig. 4.
Korpus2000
Sports news
Weather reports

45k
Unique bigrams

40k
35k
30k
25k
20k
15k
10k
5k
0

10k

20k

30k
40k
Bigrams

50k

60k

Figure 5: Bigram statistics similar to the unigram statistics


in Fig. 4.
We conclude that we have far from sufficient sport news
data for training of a bigram or trigram LM for the domain,
and tests indicate that the problem cannot be solved by
employing even heavy smoothing techniques (e.g. GoodTuring) or adaptation of LMs trained on larger corpora (like
Korpus 2000). In any case, we have to compensate for the
absence of the syntactic patterns found in questions (yes/noand wh-questions) in the data. Furthermore, a large number
of foreign names are encountered in this domain, which
should be dealt with in some way, since new players are
emerging all of the time. The problem with foreign names is
that they are hard to transcribe automatically, and that people
might not know how to utter them properly.
An elaborated rule grammar manually or semiautomatically designed to accept sentences from the sports
data including questions derived from these sentences has so
far been ruled out, as it unquestionably will result in a too
high perplexity search space and a too big network to be used
with a conventional linear Viterbi search.

4. CONCLUSIONS
In this paper we presented a mobile information access
system with spoken query answering on the basis of the DSR

technology. The system adopts a distributed architecture in


which the speech recognizer and the knowledge-based IR
system are located in different servers. The prototype
system is capable of answering spoken queries over a PDA
by using a rule grammar. The domain analysis also shows
that far more sport news data are needed for training a
statistical language modeling.

5. ACKNOWLEDGEMENTS
The project was supported by Center for TeleInFrastruktur
(CTIF) in the project POSH under CTIFs C3 program.
Furthermore, the authors thank our colleagues Henrik
Legind Larsen, Daniel Ortiz-Arroyo and Dan Saugstrup at
the Software Intelligence and Security Research Center
(SIS-RC) for providing the IR-server of the system.

6. REFERENCES
[1] Tan, Z.-H., Dalsgaard, P. and Lindberg, B.: Automatic
speech recognition over error-prone wireless networks,
Speech Communication, 47(12), 220242, 2005.
[2] Brndsted, T., Larsen, H.L., Larsen, L.B., Lindberg, B.,
Ortiz-Arroyo, D., Tan, Z.-H., Xu, H., Mobile
Information Access with Spoken Query Answering in
COST278 Final Workshop on Applied Spoken
Language Interaction in Distributed Environments"
ASIDE, Aalborg, Denmark, Nov 2005
[3] Xu, H., Tan, Z.-H., Dalsgaard, P., Mattethat, R. and
Lindberg, B.: A configurable distributed speech
recognition system, Biennial on DSP for in-Vehicle and
Mobile Systems, Sesimbra, Portugal, Sep. 2005.
[4] ETSI Standard ES 202 212. Distributed speech
recognition; extended advanced front-end feature
extraction algorithm; compression algorithm, back-end
speech reconstruction algorithm, November 2003.
[5] The CMU Sphinx Group Open Source Speech
Recognition Engines.
http://cmusphinx.sourceforge.net/html/cmusphinx.php
[6] Larsen, H.L., and Yager, R.R.: The use of fuzzy
relational thesauri for classificatory problem solving in
information retrieval and expert systems. IEEE J. on
System, Man, and Cybernetics 23(1):3141, 1993.
[7] Mathiassen, H., Nielsen N. N, and Pedersen, A.: Mining
Tables from Domain Specific HTML Text, Information
Retrieval Project Report, Aalborg University Esbjerg,
2005.
[8] 3GPP TS 26.243: ANSI-C code for the Fixed-Point
Distributed Speech Recognition Extended. Advanced
Front-end, December, 2004.
[9] Lindberg, B.: Speechdat, Danish FDB 4000 speaker
database for the fixed telephone network, pp. 198,
March 1999.
[10] Korpus2000: www.korpus.dsl.dk visited April 2006
[11] Rasmussen, M.H. and Svendsen, M.T.: Large Vocabulary
Continuous Speech Recognizer for Danish and Language
Model Adaptation, Master Thesis, Aalborg University,
2005.
[12] Harris, Z.: Mathematical Structures of Language. WileyInterscience, New York, 1968..
[13] Schofield and Zheng: A speech interface for opendomain question-answering Proceedings of 41st Annual
Meeting of the
Association for Computational
Linguistics, July 2003, pp. 177-180.

[14] Chang, E., Seide, F., Meng, H.M, et al.: A system for spoken
query information retrieval on mobile devices, IEEE Trans.
Speech and Audio Proc., 10(8):531541, 2002.
[15] Chen, B., Chen, Y.-T., Chang, C.-H., et al.: Speech retrieval
of Mandarin broadcast news via mobile devices, Interspeech
2005, Lisbon, Portugal, Sep. 2005.
[16] Reithinger, N. and Sonntag, D.: An integration framework
for a mobile multimodal dialogue system accessing the
semantic web, Interspeech 2005, Lisbon, Portugal, Sep.
2005.
[17] SmartWeb: Mobile Broadband Access to the Semantic Web.
http://www.smartweb-project.org
[18] Lindberg, B., Johansen, F.T., Warakagoda, et al., A noise
robust multilingual reference recogniser based on
SpeechDat(II), ICSLP2000, Oct. 2000.

Você também pode gostar