Escolar Documentos
Profissional Documentos
Cultura Documentos
Google Inc.
1 2
hankliao@google.com, golan@google.com
1612
per utterance from 0.85 to 1.15. This is challenging because
Table 2: Perplexity for textual Child dev sets. each utterance is short: on average 4.4 seconds long. First an
Data Set Baseline Post-filtering oracle experiment was conducted where the ground truth refer-
Child-friendly Queries 412.7 209.6 ence transcript was used to estimate the optimal warp factor at
Offensive Queries 353.2 2896.6 test time. Under this scenario, we found that the WER would
decrease from 14.0% to 13.5% on the Child test set. However,
scriptions of utterances acoustically classified as likely from when the first pass recognition hypothesis is used to estimate
children (described in Section 3.1) were used as an unsuper- the warp factor, the WER increased to 14.2%. We found that
vised corpus (80M voice queries). Second, we suppressed web a third of the utterances in the test set had a much higher error
pages that were rated as relatively difficult to read using a read- rate: around 27%; when the warp factor was estimated using
ing level classifier [29] inspired by work by Collins-Thompson these poor hypotheses, the results were worse than not using
and Callan [30] and boosted data at the 2nd to 6th grade read- VTLN at all. On these difficult utterances, the quality of the
ing level. Third, search queries from search sessions that led first pass supervision, combined with short utterance duration,
to clicks on child-friendly web sites were added to our train- did not lead to a robust estimate of the warping factors. We
ing data (3B typed queries). These web sites were identified conclude that VTLN was not effective for our application.
through an algorithm using user click behavior. Roughly, a
small set of sites of no interest to children was chosen as neg- 4.2. Combining Adult and Child Training Data
ative scored seeds, and a small set of sites of high interest to Although YouTube Kids is targeted for children, the ASR
children was chosen as positive scored seeds. Web sites that should still work well for parents that are also likely to use
were clicked in the same session where a seeded site was clicked it. An obvious solution for this is to use an additional model
received some of the seed’s score. This was iterated through dedicated to adults which will be selected in decoding time by
all sessions—in total about 0.12% of web pages were deemed some classifier, e.g. the same one we used for data selection.
child friendly (3B sentences from web pages). Separate 5-gram The main drawback from this solution is the added recognition
language models were trained on each of the above sources complexity. The approach we chose was to train one model that
and then combined using Bayesian interpolation [31] to yield will serve both adults and children. We found that adding adult
a 100M n-gram model. The vocabulary is a list of 4.6M tokens utterances to the training set works well, and in fact improves
extracted from written domain text. As shown in Table 2, we the children’s speech recognition, as shown in Table 3.
were able to model our target child audience with lower per- Table 3: Word error rate (%) for LSTM models trained on dif-
plexity, but greatly reduce the chance of offensive queries be- ferent fractions of vs2014 added to hp2014.
ing predicted compared to our baseline Voice Search language % of vs2014 # Utts Adult Child
model.
0 1.9M 37.2 11.2
4. Experimental Results 20 2.4M 14.2 10.3
40 2.9M 13.8 10.2
The experiments are conducted on proprietary, anonymized mo- 60 3.5M 13.6 10.1
bile search data sets described in Section 3.1. Language models 80 4.0M 13.4 10.0
are trained in a distributed manner [32]. All acoustic models are 100 4.5M 13.4 10.2
trained using asynchronous gradient descent [33], first with a
cross-entropy criterion and then further sequence-trained with a We also wanted to evaluate accuracy improvements from a
sMBR criterion [34] until convergence unless otherwise noted. smaller amount of child data. To check this we added portions
Cross-entropy models typically converge after 25-50 epochs, of vshp2014 to vs2014 and trained an LSTM model. We
and less than 10 for sequence training. An exponentially de- found that adding the entire child data set helps significantly
caying learning rate was used. more than adding a portion of it, as can be seen in Table 4. The
second line in Table 4 shows the result for a model trained with
4.1. Feature Compensation 200k child and 2.6M adult utterances, which achieved a 0.2%
We explored spectral smoothing, as described in Section 2, for improvement over a model trained only on adult speech.
children’s speech recognition with an LSTM acoustic model. Our results in combining vs2014 and hp2014 data for
By only smoothing at test time, the word error rate more than CLDNN models were similar, however we found that using all
doubled from 15.0% to 36.6%. This is likely because the fil- the data, rather than some fraction gave the best results for both
terbank introduced a further mismatch between test and train- Adult and Child data sets. Table 5 summarizes the effect of
ing data, so we applied the same filter widths in the training adding the entire hp2014 data set to vs2014 on both Adult
data. Utterances are aligned with the same alignment model and Child test sets. The CLDNN is clearly better than the LSTM
and default filterbank, but three different feature sets were pro- model by 6-10% relative depending on test and training condi-
duced with different minimum filter widths of 150, 200 and 250 tion. The CLDNN model trained solely on vs2014 contains
Hz. We found that the greater minimum filter width also re- one additional LSTM layer, which was removed when training
sulted in progressively worse training and test frame accuracy with vshp2014 because it increased training time without af-
during cross entropy training; the baseline word error rate wors- fecting performance. The LSTM models and vs2014 trained
ens from 12.9% to 15.7%. We suspect the smoothing destroys CLDNN have comparable number of parameters: 12.8M versus
useful low-frequency information. 13.1M. In [21], it was that shown larger LSTM models did not
We also investigated the impact of VTLN on training sets improve accuracy.
consisting of a few thousand hours of child speech for LSTM
acoustic models. For experimentation, we ignore real-time and 4.3. Results By Pitch
latency constraints and use a 2-pass decoding strategy where To further understand how LSTM and CLDNN results compare
the hypothesis from a first-pass decoding is used as supervision with respect to adult and child speakers, we apply the YIN fun-
to obtain a maximum likelihood estimate of the warping factor damental frequency algorithm [35] to estimate the average pitch
1613
this model was trained on data where 50% is high pitched. The
Table 4: Word error rate (%) for LSTM models trained on dif- gap that remains between the adult speaker error rates and child
ferent fractions of hp2014 added to vs2014. speech may indicate further research is needed.
% of hp2014 # Utts Adult Child
0 2.6M 13.4 11.9
10 2.8M 13.6 11.7
20 3.0M 13.7 11.5
40 3.4M 13.7 11.1
80 4.1M 13.7 10.7
100 4.5M 13.4 10.2
1614
6. References [21] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory re-
current neural network architectures for large scale acoustic mod-
[1] M. Gerosa, D. Giuliani, S. Narayanan, and A. Potamianos, “A re-
eling,” in Proc. Interspeech, 2014.
view of ASR technologies for children’s speech,” in Proc. Work-
shop on Child, Computer and Interaction (WOCCI), 2009. [22] P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal,
and S. Khudanpur, “A pitch extraction algorithm tuned for auto-
[2] P. G. Shivakumar, A. Potamianos, S. Lee, and S. Narayanan, “Im-
matic speech recognition,” in Proc. ICASSP, 2014.
proving speech recognition for children using acoustic adaptation
and pronunciation modeling,” in Proc. Workshop on Child, Com- [23] L. Bahl, P. de Souza, P. Gopalakrishnan, D. Nahamoo, and
puter and Interaction (WOCCI), 2014. M. Picheny, “Context dependent modelling of phones in contin-
uous speech using decision trees,” in Proc. DARPA Speech and
[3] M. Russell and S. D’Arcy, “Challenges for computer recognition
Natural Language Processing Workshop, 1991, pp. 264–270.
of children’s speech,” in Proc. Speech and Language Technologies
in Education (SLaTE), 2007. [24] S. Young, J. Odell, and P. Woodland, “Tree-based state tying for
high accuracy acoustic modelling,” in Proc. ARPA Workshop on
[4] A. Hämäläinen, S. Candeias, H. Cho, H. Meinedo, A. Abad,
Human Language Technology, 1994, pp. 307–312.
T. Pellegrini, M. Tjalve, I. Trancoso, and M. S. Dias, “Correlat-
ing ASR errors with developmental changes in speech production: [25] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Apply-
A study of 3-10-year-old European Portuguese children’s speech,” ing convolutional neural networks concepts to hybrid NN-HMM
in Proc. Workshop on Child, Computer and Interaction (WOCCI), model for speech recognition,” in Proc. ICASSP, 2012.
2014. [26] T. N. Sainath, B. Kingsbury, A. Mohamed, G. Dahl, G. Saon,
[5] A. Potamianos, S. Narayanan, and S. Lee, “Automatic speech H. Soltau, T. Beran, A. Aravkin, and B. Ramabhadran, “Im-
recognition for children,” in Proc. Eurospeech, 1997. provements to deep convolutional neural networks for LVCSR,”
in Proc. ASRU, 2013.
[6] S. S. Gray, D. Willett, J. Lu, J. Pinto, P. Maergner, and N. Boden-
stab, “Child automatic speech recognition for US English: Child [27] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ram-
interaction with living-room-electronic-devices,” in Proc. Work- abhadran, “Low-rank matrix factorization for deep neural network
shop on Child, Computer and Interaction (WOCCI), 2014. training with high-dimensional output targets,” in Proc. ICASSP,
2013.
[7] D. Elenius and M. Blomberg, “Adaptation and normalization ex-
periments in speech recognition for 4 to 8 year old children,” in [28] A. Robinson and F. Fallside, “The utility driven dynamic er-
Proc. Interspeech, 2005. ror propagation network,” University of Cambridge, Tech. Rep.
CUED/F-INFENG/TR.1, 1987.
[8] “OMG! Mobile voice survey reveals teens love to talk,” 2015,
http://googleblog.blogspot.com/2014/10/omg-mobile-voice- [29] N. Coccaro, “Find results at the right reading level,”
survey-reveals-teens.html. Dec. 2010, http://googleforstudents.blogspot.com/2010/12/find-
results-at-right-reading-level.html.
[9] “Introducing the newest member of our family, the YouTube
Kids app—available on Google Play and the App Store,” 2015, [30] K. Collins-Thompson and J. P. Callan, “A language modeling
http://youtube-global.blogspot.com/2015/02/youtube-kids.html. approach to predicting reading difficulty,” in HLT-NAACL, 2004,
pp. 193–200. [Online]. Available: ”http://acl.ldc.upenn.edu/hlt-
[10] S. Ghai and R. Sinha, “Exploring the role of spectral smoothing in
naacl2004/main/pdf/111 Paper.pdf”
context of children’s speech recognition,” in Proc. ICASSP, 2009.
[31] C. Allauzen and M. Riley, “Bayesian language model interpola-
[11] E. Eide and H. Gish, “A parametric approach to vocal tract length
tion for mobile speech input,” in Proc. Interspeech, 2011.
normalization,” in Proc. ICASSP, 1996.
[32] T. Brants, A. Popat, P. Xu, F. Och, and J. Dean, “Large language
[12] L. Lee and R. Rose, “A frequency warping approach to speaker
models in machine translation,” in Proc. of EMNLP, 2007.
normalization,” IEEE Trans. on Speech and Audio Processing,
vol. 6, no. 1, pp. 49–60, 1998. [33] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le,
M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, , and A. Y.
[13] S. Wegmann, D. McAllaster, J. Orloff, and B. Peskin, “Speaker
Ng, “Large scale distributed deep networks,” in Proc. Advances in
normalization on conversational telephone speech,” in Proc.
NIPS, 2012.
ICASSP, no. 1, Atlanta, 1996, pp. 339–341.
[34] G. Heigold, E. McDermott, V. Vanhoucke, A. Senior, and M. Bac-
[14] R. Serizel and D. Giuliani, “Vocal tract length normalisation ap-
chiani, “Asynchronous stochastic optimization for sequence train-
proaches to dnn-based children’s and adults’ speech recognition,”
ing of deep neural networks,” in Proc. ICASSP, 2014.
in Proc. of the IEEE Workshop on Spoken Language Technology
(SLT), 2014. [35] A. de Cheveigne and H. Kawahara, “YIN, a fundamental fre-
quency estimator for speech and music,” Journal of the Acoustical
[15] S. Umesh, R. Sinha, and D. R. Sanand, “Using vocal-tract length
Society of America, vol. 11, no. 4, pp. 1917–1930, Apr. 2002.
normalization in recognition of children speech,” in Proc. of Na-
tional Conference on Communications, Kanpur, Jan. 2007.
[16] M. Gales, “Maximum likelihood linear transformations for
HMM-based speech recognition,” Computer Speech and Lan-
guage, vol. 12, Jan. 1998.
[17] T. N. Sainath, A. R. Mohamed, B. Kingsbury, and B. Ramabhad-
ran, “Deep convolutional neural networks for LVCSR,” in Proc.
ICASSP. IEEE, 2013, pp. 8614–8618.
[18] M. Russell, S. D’Arcy, and L. Qun, “The effects of bandwidth re-
duction on human and computer recognition of children’s speech,”
Signal Processing Letters, IEEE, vol. 14, no. 12, pp. 1044–1046,
2007.
[19] R. P. Lippmann, E. A. Martin, and D. B. Paul, “Multi-style train-
ing for robust isolated word speech recognition,” in Proc. ICASSP,
1987.
[20] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional,
long short-term memory, fully connected deep neural networks,”
in Proc. ICASSP, 2015.
1615