Large Scale Data Enabled Evolution of Spoken Language Research and Applications

Abstract Natural Language Processing (NLP) is an interdisciplinary field whose goal is to analyze and understand human languages. Natural languages are used in two forms: written and spoken. Text and speech are the mediums for written and spoken languages, respectively. The synergistic confluence of advances in signal processing, machine learning, cognitive computing, and big data ushered in large scale data-driven approaches to speech research and applications. This chapter provides an introductory tutorial on the core tasks in speech processing, reviews recent large scale data-driven approaches to solving problems in spoken languages, describes current trends in speech research, and indicates future research directions.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[3]  Satoshi Nakamura,et al.  The ATR Multilingual Speech-to-Speech Translation System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Kishore Prahallad,et al.  Unit size in unit selection speech synthesis , 2003, INTERSPEECH.

[5]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[6]  Y.K. Muthusamy,et al.  Reviewing automatic language identification , 1994, IEEE Signal Processing Magazine.

[7]  Joaquín González-Rodríguez,et al.  Frame-by-frame language identification in short utterances using deep neural networks , 2015, Neural Networks.

[8]  Douglas D. O'Shaughnessy Speech Communications: Human and Machine , 2012 .

[9]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[10]  Hervé Bourlard,et al.  Unknown-multiple speaker clustering using HMM , 2002, INTERSPEECH.

[11]  Douglas A. Reynolds,et al.  A study of new approaches to speaker diarization , 2009, INTERSPEECH.

[12]  David A. van Leeuwen,et al.  Improved speaker recognition when using i-vectors from multiple speech sources , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Marc A. Zissman,et al.  Automatic language identification , 2001, Speech Commun..

[15]  Michael Picheny,et al.  Statistical natural language generation for speech-to-speech machine translation systems , 2002, INTERSPEECH.

[16]  Marc A. Zissman,et al.  Automatic language identification of telephone speech messages using phoneme recognition and N-gram modeling , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[19]  Marc Ferras,et al.  Speaker diarization and linking of large corpora , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[20]  Fernando Pereira,et al.  Distributed acoustic modeling with back-off n-grams , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Keiichi Tokuda,et al.  An analysis of machine translation and speech synthesis in speech-to-speech translation system , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Haizhou Li,et al.  The Asian network-based speech-to-speech translation system , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[24]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[25]  Olivier Siohan,et al.  A big data approach to acoustic model training corpus selection , 2014, INTERSPEECH.

[26]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[27]  Richard P. Lippmann,et al.  An introduction to computing with neural nets , 1987 .

[28]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Kishore Prahallad,et al.  A multilingual screen reader in Indian languages , 2010, 2010 National Conference On Communications (NCC).

[30]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[31]  Nikki Mirghafori,et al.  Nuts and Flakes: a Study of Data Characteristics in Speaker Diarization , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[32]  Jean-François Bonastre,et al.  Step-by-step and integrated approaches in broadcast news speaker diarization , 2006, Comput. Speech Lang..

[33]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[34]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[35]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[36]  Bowen Zhou,et al.  IBM MASTOR SYSTEM: Multilingual Automatic Speech-to-Speech Translator , 2006 .

[37]  Hema A. Murthy,et al.  Natural sounding TTS based on syllable-like units , 2006, 2006 14th European Signal Processing Conference.

[38]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[39]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[40]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  Thierry Dutoit,et al.  A comparative study of pitch extraction algorithms on a large variety of singing sounds , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Constantine Kotropoulos,et al.  Speaker segmentation and clustering , 2008, Signal Process..

[43]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[44]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[45]  Fernando Pereira,et al.  Distributed acoustic modeling with back-off n-grams , 2012, ICASSP.

[46]  David Gerhard,et al.  Pitch Extraction and Fundamental Frequency: History and Current Techniques , 2003 .

[47]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[48]  David A. van Leeuwen,et al.  Large-Scale Speaker Diarization for Long Recordings and Small Collections , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  B. Yegnanarayana,et al.  Artificial Neural Networks , 2004 .

[50]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[51]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[52]  Mark Johnson,et al.  How the Statistical Revolution Changes (Computational) Linguistics , 2009 .

[53]  Katrin Kirchhoff Chapter 2 – Language Characteristics , 2006 .

[54]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.

[55]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[56]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[57]  Simon King,et al.  Multisyn: Open-domain unit selection for the Festival speech synthesis system , 2007, Speech Commun..

[58]  Oliver Schreer,et al.  Diarizing large corpora using multi-modal speaker linking , 2014, INTERSPEECH.

[59]  Lukás Burget,et al.  Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[60]  Heiga Zen,et al.  Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Alan W. Black,et al.  Limited domain synthesis , 2000, INTERSPEECH.

[62]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[63]  Preethi Jyothi,et al.  Large-scale discriminative language model reranking for voice-search , 2012, WLM@NAACL-HLT.

[64]  Leena Mary Automatic Extraction of Prosody for Speaker, Language and Speech Recognition , 2012 .

[65]  Brian Roark,et al.  Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[66]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[67]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[68]  Haizhou Li,et al.  Language Identification: A Tutorial , 2011, IEEE Circuits and Systems Magazine.

[69]  Mohammad Hossein Moattar,et al.  A review on speaker diarization systems and approaches , 2012, Speech Commun..

[70]  Bayya Yegnanarayana,et al.  Extraction and representation of prosodic features for language and speaker recognition , 2008, Speech Commun..

[71]  Sanjeev Khudanpur,et al.  Efficient Subsampling for Training Complex Language Models , 2011, EMNLP.

[72]  Wei Zhang,et al.  The IBM speech-to-speech translation system for smartphone: Improvements for resource-constrained tasks , 2013, Comput. Speech Lang..

[73]  Woojay Jeon,et al.  Efficient speaker search over large populations using kernelized locality-sensitive hashing , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Frederick Jelinek,et al.  Structured language modeling , 2000, Comput. Speech Lang..

[75]  Wang Lirong,et al.  Articulatory Speech Synthesis: A Survey , 2011, 2011 14th IEEE International Conference on Computational Science and Engineering.

[76]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[77]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[78]  Hema A. Murthy,et al.  Methods for improving the quality of syllable based speech synthesis , 2008, 2008 IEEE Spoken Language Technology Workshop.

[79]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[80]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[81]  H Bung Automatic speech recognition and understanding : A first step toward natural human-machine communication , 2000 .

[82]  Vijay V. Raghavan,et al.  Big Data: Promises and Problems , 2015, Computer.

[83]  Rohit Prasad,et al.  Batch-mode semi-supervised active learning for statistical machine translation , 2013, Comput. Speech Lang..

[84]  Vijay V. Raghavan,et al.  Big Data Driven Natural Language Processing Research and Applications , 2015 .

[85]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[86]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[87]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[88]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[89]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[90]  S. Furui,et al.  Automatic recognition and understanding of spoken language - a first step toward natural human-machine communication , 2000, Proceedings of the IEEE.

[91]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[92]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[93]  Vijendra Raj Apsingekar,et al.  Speaker Model Clustering for Efficient Speaker Identification in Large Population Applications , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[94]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[95]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[96]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[97]  Johan Schalkwyk,et al.  Query language modeling for voice search , 2010, 2010 IEEE Spoken Language Technology Workshop.

[98]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[99]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[100]  Sameeraj Meduri,et al.  A survey and evaluation of voice activity detection algorithms: speech processing module , 2012 .

[101]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[102]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[103]  Panayiotis G. Georgiou,et al.  Unsupervised data processing for classifier-based speech translator , 2013, Comput. Speech Lang..

[104]  Douglas D. O'Shaughnessy,et al.  Invited paper: Automatic speech recognition: History, methods and challenges , 2008, Pattern Recognit..

[105]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[106]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[107]  Ronald W. Schafer,et al.  Theory and Applications of Digital Speech Processing , 2010 .

[108]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[109]  Tanja Schultz,et al.  LVCSR-based language identification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.