Spoken Language Recognition: From Fundamentals to Practice

Spoken language recognition refers to the automatic process through which we determine or verify the identity of the language spoken in a speech sample. We study a computational framework that allows such a decision to be made in a quantitative manner. In recent decades, we have made tremendous progress in spoken language recognition, which benefited from technological breakthroughs in related areas, such as signal processing, pattern recognition, cognitive science, and machine learning. In this paper, we attempt to provide an introductory tutorial on the fundamentals of the theory and the state-of-the-art solutions, from both phonological and computational aspects. We also give a comprehensive review of current trends and future research directions using the language recognition evaluation (LRE) formulated by the National Institute of Standards and Technology (NIST) as the case studies.

[1]  Pavel Matejka,et al.  Phonotactic language identification using high quality phoneme recognition , 2005, INTERSPEECH.

[2]  Niko Brümmer,et al.  Measuring, refining and calibrating speaker and language information extracted from speech , 2010 .

[3]  William M. Campbell A covariance kernel for svm language recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Hagen Soltau,et al.  Discriminative Phonotactics for Dialect Recognition Using Context-Dependent Phone Classifiers , 2010, Odyssey.

[5]  Pietro Laface,et al.  Compensation of Nuisance Factors for Speaker and Language Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tanja Schultz,et al.  LVCSR-based language identification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[8]  Douglas E. Sturim,et al.  A comparison of subspace feature-domain methods for language recognition , 2008, INTERSPEECH.

[9]  Sridha Sridharan,et al.  Explicit modelling of session variability for speaker verification , 2008, Comput. Speech Lang..

[10]  Alvin F. Martin,et al.  The Current State of Language Recognition: NIST 2005 Evaluation Results , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[11]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[12]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[13]  Yonghong Yan,et al.  An approach to automatic language identification based on language-dependent phone recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  Chin-Hui Lee,et al.  Exploring universal attribute characterization of spoken languages for spoken language recognition , 2009, INTERSPEECH.

[15]  Haizhou Li,et al.  A GMM-supervector approach to language recognition with adaptive relevance factor , 2010, 2010 18th European Signal Processing Conference.

[16]  Haizhou Li,et al.  Language Identification: A Tutorial , 2011, IEEE Circuits and Systems Magazine.

[17]  Jean-Luc Gauvain,et al.  Language identification using phone-based acoustic likelihoods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Wiebe van der Hoek,et al.  SOFSEM 2007: Theory and Practice of Computer Science , 2007 .

[19]  William M. Campbell,et al.  High-level speaker verification with support vector machines , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Etienne Barnard,et al.  Analysis of phoneme-based features for language identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[22]  Haizhou Li,et al.  GMM-SVM Kernel With a Bhattacharyya-Based Distance for Speaker Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Yonghong Yan,et al.  Development of an approach to automatic language identification based on phone recognition , 1996, Comput. Speech Lang..

[24]  Haizhou Li,et al.  Vector-Based Spoken Language Classification , 2008 .

[25]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[26]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[27]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[28]  N. Brummer,et al.  Channel-dependent GMM and Multi-class Logistic Regression models for language recognition , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[29]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[30]  Alvin F. Martin,et al.  The broadcast narrow band speech corpus: a new resource type for large scale language recognition , 2009, INTERSPEECH.

[31]  Qu Dan,et al.  Discriminative Training of GMM for Language Identification , 2003 .

[32]  Pietro Laface,et al.  Channel Factors Compensation in Model and Feature Domain for Speaker Recognition , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[33]  Bin Ma,et al.  Multilingual speech recognition with language identification , 2002, INTERSPEECH.

[34]  Haizhou Li,et al.  TechWare: Speaker and Spoken Language Recognition Resources , 2010 .

[35]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[36]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[37]  Lukás Burget,et al.  Discriminative Training Techniques for Acoustic Language Identification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[38]  David A. van Leeuwen,et al.  A human benchmark for the NIST language recognition evaluation 2005 , 2008, Odyssey.

[39]  William M. Campbell,et al.  Experiments with Lattice-based PPRLM Language Identification , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[40]  Luis Javier Rodríguez-Fuentes,et al.  Improved Modeling of Cross-Decoder Phone Co-Occurrences in SVM-Based Phonotactic Language Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[42]  George R. Doddington,et al.  Automatic Language Identification. , 1974 .

[43]  Roger C. F. Tucker,et al.  Automatic language identification using sub-word models , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[45]  Haizhou Li,et al.  Spoken Language recognition using support vector machines with generative front-end , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  David Miller,et al.  The Mixer Corpus of Multilingual, Multichannel Speaker Recognition Data , 2004, LREC.

[47]  William M. Campbell,et al.  Advanced Language Recognition using Cepstra and Phonotactics: MITLL System Performance on the NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[48]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[49]  David A. van Leeuwen,et al.  An Introduction to Application-Independent Evaluation of Speaker Recognition Systems , 2007, Speaker Classification.

[50]  Jirí Navrátil,et al.  Spoken language recognition-a step toward multilinguality in speech processing , 2001, IEEE Trans. Speech Audio Process..

[51]  Michael I. Jordan,et al.  Mixtures of Probabilistic Principal Component Analyzers , 2001 .

[52]  Seiichi Nakagawa,et al.  Speaker-independent, text-independent language identification by HMM , 1992, ICSLP.

[53]  Rong Tong,et al.  Integrating Acoustic, Prosodic and Phonotactic Features for Spoken Language Identification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[54]  Victor Zue,et al.  Automatic language identification using a segment-based approach , 1993, EUROSPEECH.

[55]  Bin Ma,et al.  Using local & global phonotactic features in Chinese dialect identification , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[56]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[57]  Timothy J. Hazen,et al.  Segment-based automatic language identification , 1997 .

[58]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[59]  Rong Tong,et al.  Target-Aware Lattice Rescoring for Dialect Recognition , 2011, INTERSPEECH.

[60]  Ronald A. Cole,et al.  A comparison of approaches to automatic language identification using telephone speech , 1993, EUROSPEECH.

[61]  Alvin F. Martin,et al.  The 2011 NIST Language Recognition Evaluation , 2010, INTERSPEECH.

[62]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[63]  Xiao Li,et al.  Machine Learning Paradigms for Speech Recognition: An Overview , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[64]  F. Ramus,et al.  Correlates of linguistic rhythm in the speech signal , 1999, Cognition.

[65]  Rong Tong,et al.  A Target-Oriented Phonotactic Front-End for Spoken Language Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[66]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[67]  T. J. Edwards,et al.  Statistical models for automatic language identification , 1980, ICASSP.

[68]  Lukás Burget,et al.  iVector Fusion of Prosodic and Cepstral Features for Speaker Verification , 2011, INTERSPEECH.

[69]  Bernard Comrie,et al.  The World's Major Languages , 1987 .

[70]  Douglas A. Reynolds,et al.  Beyond frame independence: parametric modelling of time duration in speaker and language recognition , 2008, INTERSPEECH.

[71]  T. Kinnunen,et al.  Using Discrete Probabilities With Bhattacharyya Measure for SVM-Based Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[72]  Pietro Laface,et al.  Analysis of Large-Scale SVM Training Algorithms for Language and Speaker Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[73]  Bin Ma,et al.  Soft margin estimation of Gaussian mixture model parameters for spoken language recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[74]  Bin Ma,et al.  Towards long-range prosodic attribute modeling for language recognition , 2010, INTERSPEECH.

[75]  Jingjing Zhao,et al.  Cortical competition during language discrimination , 2008, NeuroImage.

[76]  A. Waibel,et al.  Multilinguality in speech and spoken language systems , 2000, Proceedings of the IEEE.

[77]  Douglas E. Sturim,et al.  The MITLL NIST LRE 2009 language recognition system , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[78]  Max Welling Donald,et al.  Products of Experts , 2007 .

[79]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[80]  Pascal Druyts,et al.  Applying Logistic Regression to the Fusion of the NIST'99 1-Speaker Submissions , 2000, Digit. Signal Process..

[81]  Etienne Barnard,et al.  Language identification of six languages based on a common set of broad phonemes , 1994, ICSLP.

[82]  Ludmila I. Kuncheva,et al.  A Theoretical Study on Six Classifier Fusion Strategies , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[83]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[84]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[85]  David A. van Leeuwen,et al.  Channel-dependent GMM and Multi-class Logistic Regression models for language recognition , 2006, Odyssey.

[86]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[87]  N. Brummer,et al.  On calibration of language recognition scores , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[88]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[89]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[90]  Victor Zue,et al.  Recent improvements in an approach to segment-based automatic language identification , 1994, ICSLP.

[91]  David A. van Leeuwen,et al.  An open-set detection evaluation methodology applied to language and emotion recognition , 2007, INTERSPEECH.

[92]  J. Foil,et al.  Language identification using noisy speech , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[93]  Patrick Kenny,et al.  Experiments in speaker verification using factor analysis likelihood ratios , 2004, Odyssey.

[94]  A. House,et al.  Toward automatic identification of the language of an utterance. I. Preliminary methodological con , 1977 .

[95]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[96]  Shubha Kadambe,et al.  Spoken language identification using large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[97]  Thomas P. Minka,et al.  Algorithms for maximum-likelihood logistic regression , 2003 .

[98]  Ronald A. Cole,et al.  Perceptual benchmarks for automatic language identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[99]  Bob Carpenter,et al.  Vector-based Natural Language Call Routing , 1999, Comput. Linguistics.

[100]  William M. Campbell,et al.  Language recognition with discriminative keyword selection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[101]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[102]  Ronald A. Cole,et al.  The OGI 22 language telephone speech corpus , 1995, EUROSPEECH.

[103]  Larry Gillick,et al.  Automatic language identification using large vocabulary continuous speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[104]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[105]  Bin Ma,et al.  TechWare: Speaker and Spoken Language Recognition Resources [Best of the Web] , 2010, IEEE Signal Processing Magazine.

[106]  Haizhou Li,et al.  Spoken Language Recognition in the Latent Topic Simplex , 2011, INTERSPEECH.

[107]  Ronald A. Cole,et al.  The OGI multi-language telephone speech corpus , 1992, ICSLP.

[108]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[109]  William M. Campbell,et al.  Advances in channel compensation for SVM speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[110]  Michael Ashby,et al.  Introducing Phonetic Science , 2005 .

[111]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[112]  Josef Kittler,et al.  Combining classifiers: A theoretical framework , 1998, Pattern Analysis and Applications.

[113]  Jean-Luc Gauvain,et al.  Language recognition using phone latices , 2004, INTERSPEECH.

[114]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[115]  Rong Tong,et al.  Spoken Language Recognition Using Ensemble Classifiers , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[116]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[117]  Patrick Kenny,et al.  Modeling Prosodic Features With Joint Factor Analysis for Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[118]  Y.K. Muthusamy,et al.  Reviewing automatic language identification , 1994, IEEE Signal Processing Magazine.

[119]  Haizhou Li,et al.  On Acoustic Diversification Front-End for Spoken Language Identification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[120]  Russell B. Ives,et al.  Development of an automatic identification system of spoken languages: Phase I , 1982, ICASSP.

[121]  John S. Garofolo,et al.  NIST Speech Processing Evaluations: LVCSR, Speaker Recognition, Language Recognition , 2007 .

[122]  M. Sugiyama,et al.  Automatic language recognition using acoustic features , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[123]  Haim Levkowitz,et al.  Automatic language identification with perceptually guided training and recurrent neural networks , 1998, ICSLP.

[124]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[125]  Chin-Hui Lee,et al.  Principles of Spoken Language Recognition , 2008 .

[126]  Douglas E. Sturim,et al.  The MITLL NIST LRE 2015 Language Recognition System , 2016, Odyssey.

[127]  F. Ramus,et al.  Language identification with suprasegmental cues: a study based on speech resynthesis. , 1999, The Journal of the Acoustical Society of America.

[128]  David Crystal The Cambridge factfinder , 1993 .

[129]  Douglas A. Reynolds,et al.  Language identification using Gaussian mixture model tokenization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[130]  Douglas A. Reynolds,et al.  Improved GMM-based language recognition using constrained MLLR transforms , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[131]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[132]  Tsuhan Chen,et al.  Improved speaker verification through probabilistic subspace adaptation , 2003, INTERSPEECH.

[133]  Alvin F. Martin,et al.  NIST 2003 language recognition evaluation , 2003, INTERSPEECH.

[134]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[135]  Marc A. Zissman,et al.  Predicting, diagnosing and improving automatic language identification performance , 1997, EUROSPEECH.

[136]  Marc A. Zissman,et al.  Automatic language identification using Gaussian mixture and hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[137]  Timothy J. Hazen,et al.  Retrieval and browsing of spoken content , 2008, IEEE Signal Processing Magazine.

[138]  Bin Ma,et al.  Prosodic attribute model for spoken language identification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[139]  F. J. Goodman,et al.  Improved automatic language identification in noisy speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[140]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[141]  Chin-Hui Lee,et al.  Exploiting context-dependency and acoustic resolution of universal speech attribute models in spoken language recognition , 2010, INTERSPEECH.

[142]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[143]  Pavel Matejka,et al.  Description and analysis of the Brno276 system for LRE2011 , 2012, Odyssey.

[144]  Bin Ma,et al.  A Phonotactic Language Model for Spoken Language Identification , 2005, ACL.

[145]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[146]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.