Automatic Speech Recognition and Identification of African Portuguese

This document deals with speech recognition of different Portuguese varieties, it resumes results from the author’s diploma thesis [9]. The performance of a hybrid large vocabulary continuous speech recognizer, which combines multi-layer perceptrons and Hidden Markov Models, degrades heavily in the presence of African Portuguese varieties in broadcast news. Variety-specific acoustic and language models are shown to improve recognition significantly by up to 21.1%, from 30.1% WER to 23.7% WER. Further, this document discusses a novel and efficient approach to automatically distinguish African from European Portuguese, first presented in [8] [10]. The phonotactic variety identification system, based on phone recognition and language modeling, focuses on a single tokenizer that combines distinctive knowledge about differences between the target varieties. This knowledge is introduced into a multi-layer perceptron phone recognizer by training variety-dependent phone models for two varieties as contrasting classes. Significant improvements were achieved, lowering the computational cost and reducing the equal error rate by more than 60%, from 11.4% EER to 4.1% EER, compared to conventional single and fused phonotactic and acoustic systems. The approach is extended to cover Brazilian Portuguese, where it also shows high variety identification performance.

[1]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[4]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[5]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[6]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[7]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[9]  Etienne Barnard,et al.  Analysis of phoneme-based features for language identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Jean-Luc Gauvain,et al.  Developments in continuous speech dictation using the ARPA WSJ task , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[11]  Ronald Rosenfeld,et al.  Optimizing lexical and N-gram coverage via judicious use of linguistic data , 1995, EUROSPEECH.

[12]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[13]  Isabel Trancoso,et al.  Accent identification , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[14]  Ciro Martins,et al.  The development of a speaker independent continuous speech recognizer for portuguese , 1997, EUROSPEECH.

[15]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[16]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[17]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[18]  João Paulo da Silva Neto,et al.  Combination of acoustic models in continuous speech recognition hybrid systems , 2000, INTERSPEECH.

[19]  Marc A. Zissman,et al.  Automatic language identification , 2001, Speech Commun..

[20]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[21]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[22]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[23]  I. Trancoso,et al.  A Comparative Description of GtoP modules for Portuguese and Mirandese using Finite State Transducers , 2003 .

[24]  João Paulo da Silva Neto,et al.  AUDIMUS.MEDIA: A Broadcast News Speech Recognition System for the European Portuguese Language , 2003, PROPOR.

[25]  William M. Campbell,et al.  High-level speaker verification with support vector machines , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Patrick Kenny,et al.  Experiments in speaker verification using factor analysis likelihood ratios , 2004, Odyssey.

[27]  Jean-Luc Gauvain,et al.  Language recognition using phone latices , 2004, INTERSPEECH.

[28]  Hermann Ney,et al.  Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[29]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[30]  Lukás Burget,et al.  Brno University of Technology System for NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[31]  Jan Robert Stadermann Automatische Spracherkennung mit hybriden akustischen Modellen , 2006 .

[32]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[33]  Doroteo Torre Toledano,et al.  Exploring PPRLM performance for NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[34]  Mark J. F. Gales,et al.  Unsupervised Training for Mandarin Broadcast News and Conversation Transcription , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[35]  Pietro Laface,et al.  Compensation of Nuisance Factors for Speaker and Language Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Douglas A. Reynolds,et al.  Improving phonotactic language recognition with acoustic adaptation , 2007, INTERSPEECH.

[37]  João Paulo da Silva Neto,et al.  Incorporating acoustical modelling of phone transitions in an hybrid ANN/HMM speech recognizer , 2008, INTERSPEECH.

[38]  Isabel Trancoso,et al.  Language and variety verification on broadcast news for Portuguese , 2008, Speech Commun..

[39]  Richard M. Schwartz,et al.  Unsupervised versus supervised training of acoustic models , 2008, INTERSPEECH.

[40]  Li-Rong Dai,et al.  The Adaptation Schemes In PR-SVM Based Language Recognition , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[41]  João Paulo da Silva Neto,et al.  Evaluation of a live broadcast news subtitling system for portuguese , 2008, INTERSPEECH.

[42]  Douglas E. Sturim,et al.  A comparison of subspace feature-domain methods for language recognition , 2008, INTERSPEECH.

[43]  William M. Campbell A covariance kernel for svm language recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Isabel Trancoso,et al.  Porting an european portuguese broadcast news recognition system to brazilian portuguese , 2009, INTERSPEECH.

[45]  Rong Tong,et al.  Target-aware language models for spoken language recognition , 2009, INTERSPEECH.

[46]  Bin Ma,et al.  Prosodic attribute model for spoken language identification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Isabel Trancoso,et al.  Exploiting variety-dependent phones in portuguese variety identification applied to broadcast news transcription , 2010, INTERSPEECH.

[48]  Pietro Laface,et al.  Loquendo-Politecnico di Torino system for the 2009 NIST Language Recognition Evaluation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  Ciro Martins,et al.  Dynamic language modeling for European Portuguese , 2010, Comput. Speech Lang..