Towards unsupervised speech processing

The development of an automatic speech recognizer is typically a highly supervised process involving the specification of phonetic inventories, lexicons, acoustic and language models, and requiring annotated training corpora consisting of parallel speech and text data. Although some model parameters may be modified via adaptation, the overall structure of the speech recognizer usually remains relatively static. While this approach has been effective for problems where there is adequate human expertise, and labelled corpora are available, it is challenged by less-supervised or unsupervised scenarios. It also contrasts sharply with human speech processing where learning is an inherent ability. In this paper, three alternative scenarios for speech recognition “training” are described, each requiring decreasing amounts of human expertise and annotated resources, and increasing amounts of unsupervised learning. A speech deciphering challenge is then suggested whereby speech recognizers must learn sub-word inventories and word pronunciations from unannotated speech, supplemented with only non-parallel text resources. It is argued that such a capability will help alleviate the language barrier that currently limits the scope of speech recognition capabilities around the world, and empower speech recognizers to continually learn and evolve through use.

[1]  Frédéric Bimbot,et al.  Zero-Resource Audio-Only Spoken Term Detection Based on a Combination of Template Matching Techniques , 2011, INTERSPEECH.

[2]  Alex Waibel,et al.  Readings in speech recognition , 1990 .

[3]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.

[4]  James R. Glass,et al.  Towards multi-speaker unsupervised speech pattern discovery , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Alexander Gruenstein,et al.  Unsupervised Testing Strategies for ASR , 2011, INTERSPEECH.

[6]  Fernando Pereira,et al.  Weighted Rational Transductions and their Application to Human Language Processing , 1994, HLT.

[7]  Frédéric Bimbot,et al.  Audio keyword extraction by unsupervised word discovery , 2009, INTERSPEECH.

[8]  James R. Glass,et al.  Developments and directions in speech recognition and understanding, Part 1 [DSP Education] , 2009, IEEE Signal Processing Magazine.

[9]  Giuseppe Riccardi,et al.  ACTIVE AND UNSUPERVISED LEARNING FOR A , 2003 .

[10]  B. Lowerre,et al.  Dynamic speaker adaptation in the Harpy speech recognition system , 1977 .

[11]  S. Pinker The Language Instinct , 1994 .

[12]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Aren Jansen,et al.  Towards Unsupervised Training of Speaker Independent Acoustic Models , 2011, INTERSPEECH.

[14]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[15]  James R. Glass,et al.  Towards unsupervised pattern discovery in speech , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[16]  Chris Callison-Burch,et al.  Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription , 2010, NAACL.

[17]  Wayne A. Lea,et al.  Trends in Speech Recognition , 1980 .

[18]  Aren Jansen,et al.  NLP on Spoken Documents Without ASR , 2010, EMNLP.

[19]  P. Jusczyk The discovery of spoken language , 1997 .

[20]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[21]  Kenneth Ward Church,et al.  Towards spoken term discovery at scale with zero resources , 2010, INTERSPEECH.

[22]  Kristin Precoda Non-Mainstream Languages and Speech Recognition: Some Challenges , 2013 .

[23]  James R. Glass,et al.  Pronunciation Learning from Continuous Speech , 2011, INTERSPEECH.

[24]  Herbert Gish,et al.  Unsupervised Audio Patterns Discovery Using HMM-Based Self-Organized Units , 2011, INTERSPEECH.

[25]  Regina Barzilay,et al.  A Statistical Model for Lost Language Decipherment , 2010, ACL.

[26]  J. Saffran Constraints on Statistical Language Learning , 2002 .

[27]  John S. Garofolo,et al.  NIST Speech Processing Evaluations: LVCSR, Speaker Recognition, Language Recognition , 2007 .

[28]  Herbert Gish,et al.  Unsupervised training of an HMM-based speech recognizer for topic classification , 2009, INTERSPEECH.

[29]  C. Habel,et al.  Language , 1931, NeuroImage.

[30]  Richard M. Schwartz,et al.  Unsupervised acoustic and language model training with small amounts of labelled data , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[32]  Herbert Gish,et al.  Improved topic classification and keyword discovery using an HMM-based speech recognizer trained without supervision , 2010, INTERSPEECH.

[33]  Kevin Knight,et al.  Deciphering Foreign Language , 2011, ACL.

[34]  James R. Glass,et al.  Collecting Voices from the Cloud , 2010, LREC.

[35]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[36]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[37]  James R. Glass,et al.  Making Sense of Sound: Unsupervised Topic Segmentation over Acoustic Input , 2007, ACL.

[38]  Tanja Schultz,et al.  Multilingual Speech Processing , 2006 .

[39]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[40]  James Glass,et al.  Research Developments and Directions in Speech Recognition and Understanding, Part 1 , 2009 .

[41]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[42]  J. Baker,et al.  The DRAGON system--An overview , 1975 .