A Real-Time End-to-End Multilingual Speech Recognition Architecture

Automatic speech recognition (ASR) systems are used daily by millions of people worldwide to dictate messages, control devices, initiate searches or to facilitate data input in small devices. The user experience in these scenarios depends on the quality of the speech transcriptions and on the responsiveness of the system. For multilingual users, a further obstacle to natural interaction is the monolingual character of many ASR systems, in which users are constrained to a single preset language. In this work, we present an end-to-end multi-language ASR architecture, developed and deployed at Google, that allows users to select arbitrary combinations of spoken languages. We leverage recent advances in language identification and a novel method of real-time language selection to achieve similar recognition accuracy and nearly-identical latency characteristics as a monolingual system.

[1]  Georg Heigold,et al.  Asynchronous stochastic optimization for sequence training of deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Thomas Niesler,et al.  Language identification and multilingual speech recognition using discriminatively trained acoustic models , 2006 .

[3]  Luca Maria Gambardella,et al.  Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition , 2010, ArXiv.

[4]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[5]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[6]  Tanja Schultz,et al.  Language independent and language adaptive large vocabulary speech recognition , 1998, ICSLP.

[7]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[8]  W. Baker,et al.  DIALECT IDENTIFICATION: THE EFFECTS OF REGION OF ORIGIN AND AMOUNT OF EXPERIENCE , 2009 .

[9]  Julia Hirschberg,et al.  Automatic Dialect and Accent Recognition and its Application to Speech Recognition , 2011 .

[10]  Holger Caesar INTEGRATING LANGUAGE IDENTIFICATION TO IMPROVE MULTILINGUAL SPEECH RECOGNITION , 2012 .

[11]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[12]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Douglas E. Sturim,et al.  Eigen-channel compensation and discriminatively trained Gaussian mixture models for dialect and accent recognition , 2008, INTERSPEECH.

[14]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[15]  Hung-An Chang,et al.  Recognizing English queries in Mandarin Voice Search , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  F. Grosjean Bilingual: Life and Reality , 2010 .

[17]  A. Lee Swindlehurst,et al.  IEEE Journal of Selected Topics in Signal Processing Inaugural Issue: [editor-in-chief's message] , 2007, J. Sel. Topics Signal Processing.

[18]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Hui Lin,et al.  Learning Methods in Multilingual Speech Recognition , 2008, NIPS 2008.

[21]  Yun Lei,et al.  Dialect identification: Impact of differences between read versus spontaneous speech , 2010, 2010 18th European Signal Processing Conference.

[22]  Olivier Siohan,et al.  A big data approach to acoustic model training corpus selection , 2014, INTERSPEECH.

[23]  G. Richard Tucker,et al.  A Global Perspective on Bilingualism and Bilingual Education. ERIC Digest. , 1999 .

[24]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Hui Lin,et al.  Recognition of multilingual speech in mobile applications , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP] , 2011, IEEE Signal Processing Magazine.

[27]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[28]  Dorothy Waggoner The Growth of Multilingualism and the Need for Bilingual Education: What Do We Know So Far?. , 1993 .