Human and computer recognition of regional accents and ethnic groups from British English speech

The paralinguistic information in a speech signal includes clues to the geographical and social background of the speaker. This paper is concerned with automatic extraction of this information from a short segment of speech. A state-of-the-art language identification (LID) system is applied to the problems of regional accent recognition for British English, and ethnic group recognition within a particular accent. We compare the results with human performance and, for accent recognition, the 'text dependent' ACCDIST accent recognition measure. For the 14 regional accents of British English in the ABI-1 corpus (good quality read speech), our LID system achieves a recognition accuracy of 89.6%, compared with 95.18% for our best ACCDIST-based system and 58.24% for human listeners. The ''Voices across Birmingham'' corpus contains significant amounts of telephone conversational speech for the two largest ethnic groups in the city of Birmingham (UK), namely the 'Asian' and 'White' communities. Our LID system distinguishes between these two groups with an accuracy of 96.51% compared with 90.24% for human listeners. Although direct comparison is difficult, it seems that our LID system performs much better on the standard 12 class NIST 2003 Language Recognition Evaluation task or the two class ethnic group recognition task than on the 14 class regional accent recognition task. We conclude that automatic accent recognition is a challenging task for speech technology, and speculate that the use of natural conversational speech may be advantageous for these types of paralinguistic task.

[1]  Qu Dan,et al.  Discriminative Training of GMM for Language Identification , 2003 .

[2]  Mark Huckvale ACCDIST: An Accent Similarity Metric for Accent Recognition and Diagnosis , 2007, Speaker Classification.

[3]  Martin J. Russell,et al.  Improved language recognition using mixture components statistics , 2010, INTERSPEECH.

[4]  Steve Mann,et al.  Computer vision signal processing on graphics processing units , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[6]  Simon King,et al.  Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Patrick Kenny,et al.  Disentangling speaker and channel effects in speaker verification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  W. J. Barry,et al.  An approach to the problem of regional accent in automatic speech recognition , 1989 .

[9]  Ian Buck,et al.  Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs Using CUDA , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[10]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[11]  Marc A. Zissman,et al.  Automatic dialect identification of extemporaneous conversational, Latin American Spanish speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  Jianwu Dang,et al.  An investigation of dependencies between frequency components and speaker characteristics for text-independent speaker identification , 2008, Speech Commun..

[13]  Miran Kim,et al.  The phonetics of stress manifestation: Segmental variation, syllable constituency and rhythm , 2011 .

[14]  Sridha Sridharan,et al.  Data-driven clustering for blind feature mapping in speaker verification , 2005, INTERSPEECH.

[15]  Ganakumaran Subramaniam,et al.  The Changing Tenor of English in Multicultural Postcolonial Malaysia , 2007 .

[16]  Pietro Laface,et al.  Acoustic language identification using fast discriminative training , 2007, INTERSPEECH.

[17]  Shrikanth S. Narayanan,et al.  Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[18]  Peter Collins,et al.  Aspects of the Verbal System of Malaysian English and Other Englishes , 2013 .

[19]  Sadaoki Furui,et al.  Fast acoustic computations using graphics processors , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Jonathan G. Fiscus,et al.  The development of file formats for very large speech corpora: SPHERE and SHORTEN , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Philippe Boula de Mareüil,et al.  Identification of regional accents in French: perception and categorization , 2006, INTERSPEECH.

[22]  P. Iverson,et al.  Vowel normalization for accent: an investigation of best exemplar locations in northern and southern British English sentences. , 2004, The Journal of the Acoustical Society of America.

[23]  Douglas A. Reynolds,et al.  Language identification using Gaussian mixture model tokenization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Nizar Habash,et al.  Spoken Arabic Dialect Identification Using Phonotactic Modeling , 2009, SEMITIC@EACL.

[25]  Thomas Fang Zheng,et al.  Using cepstral and prosodic features for Chinese accent identification , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[26]  Pietro Laface,et al.  Channel Factors Compensation in Model and Feature Domain for Speaker Recognition , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[27]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[28]  Pavel Matejka,et al.  Phonotactic language identification using high quality phoneme recognition , 2005, INTERSPEECH.

[29]  Julia Hirschberg,et al.  Dialect recognition using a phone-GMM-supervector-based SVM kernel , 2010, INTERSPEECH.

[30]  Alvin F. Martin,et al.  NIST 2003 language recognition evaluation , 2003, INTERSPEECH.

[31]  Alan V. Oppenheim,et al.  Discrete-Time Signal Pro-cessing , 1989 .

[32]  Yonghong Yan,et al.  Using SVM as Back-End Classifier for Language Identification , 2008, EURASIP J. Audio Speech Music. Process..

[33]  Mohammad Hossein Sedaaghi,et al.  A Comparative Study of Gender and Age Classification in Speech Signals , 2009 .

[34]  Joanne Rajadurai,et al.  The faces and facets of English in Malaysia , 2004, English Today.

[35]  Naga K. Govindaraju,et al.  Auto-tuning of fast fourier transform on graphics processors , 2011, PPoPP '11.

[36]  P. Trudgill,et al.  English Accents and Dialects : An Introduction to Social and Regional Varieties of English in the British Isles , 1996 .

[37]  Douglas A. Reynolds,et al.  Channel robust speaker verification via feature mapping , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[38]  Jrg Kaiser,et al.  Nonrecursive digital filter design using the I-sinh window function , 1977 .

[39]  William M. Campbell A covariance kernel for svm language recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  William M. Campbell,et al.  Discriminative n-gram selection for dialect recognition , 2009, INTERSPEECH.

[41]  Andreas Stolcke,et al.  Speech Recognition as Feature Extraction for Speaker Recognition , 2007 .

[42]  Marc A. Zissman,et al.  Automatic language identification using Gaussian mixture and hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  Tony Jebara,et al.  A Kernel Between Sets of Vectors , 2003, ICML.

[44]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[45]  Manfred R. Schroeder,et al.  Code-excited linear prediction(CELP): High-quality speech at very low bit rates , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  John H. L. Hansen,et al.  Advances in phone-based modeling for automatic accent classification , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[47]  Anu Khosla,et al.  Automatic identification of gender & accent in spoken Hindi utterances with regional Indian accents , 2008, 2008 IEEE Spoken Language Technology Workshop.

[48]  Herbert Gish,et al.  Discriminatively trained Language Models using Support Vector Machines for Language Identification , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[49]  Randi Reppen,et al.  Building a corpus: What are the key considerations? , 2010 .

[50]  R F Orlikoff,et al.  Speaker race identification from acoustic cues in the vocal signal. , 1994, Journal of speech and hearing research.

[51]  Lukás Burget,et al.  Discriminative Training Techniques for Acoustic Language Identification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[52]  David A. van Leeuwen,et al.  A human benchmark for the NIST language recognition evaluation 2005 , 2008, Odyssey.

[53]  Ian McLoughlin,et al.  Applied Speech and Audio Processing: With Matlab Examples , 2009 .

[54]  Douglas E. Sturim,et al.  Classification Methods for Speaker Recognition , 2007, Speaker Classification.

[55]  Salikoko S. Mufwene English around the World: Sociolinguistic Perspectives , 1993 .

[56]  Pierre Dumouchel,et al.  GPU accelerated acoustic likelihood computations , 2008, INTERSPEECH.

[57]  William M. Campbell,et al.  Speaker Verification Using Support Vector Machines and High-Level Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Julia Hirschberg,et al.  Dialect Recognition Using a Phone-GMM-Supervector-Based SVM Kernel , 2010 .

[59]  Jean-Luc Gauvain,et al.  Language recognition using phone latices , 2004, INTERSPEECH.

[60]  D.A. Reynolds,et al.  Large population speaker identification using clean and telephone speech , 1995, IEEE Signal Processing Letters.

[61]  W. Idsardi,et al.  Perceptual and Phonetic Experiments on American English Dialect Identification , 1999 .

[62]  David R. Miller,et al.  Statistical dialect classification based on mean phonetic features , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[63]  Mark J. F. Gales,et al.  An improved approach to the hidden Markov model decomposition of speech and noise , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[64]  Lukás Burget,et al.  Brno University of Technology System for NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[65]  Mark Huckvale,et al.  Pronunciation variation modelling using accent features , 2005, INTERSPEECH.

[66]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[67]  Ugo Erra Toward Real Time Fractal Image Compression Using Graphics Hardware , 2005, ISVC.

[68]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[69]  Deborah Loakes,et al.  Front Vowels as Speaker-Specific : Some Evidence from Australian English , 2004 .

[70]  Nobuaki Minematsu Mathematical evidence of the acoustic universal structure in speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[71]  Jean-François Bonastre,et al.  Localization and selection of speaker-specific information with statistical modeling , 2000, Speech Commun..

[72]  John H. L. Hansen,et al.  Language accent classification in American English , 1996, Speech Commun..

[73]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[74]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[75]  William M. Campbell,et al.  Acoustic, phonetic, and discriminative approaches to automatic language identification , 2003, INTERSPEECH.

[76]  Seiichi Nakagawa,et al.  Speaker-independent, text-independent language identification by HMM , 1992, ICSLP.

[77]  Pedro J. Moreno,et al.  A new SVM approach to speaker identification and verification using probabilistic distance kernels , 2003, INTERSPEECH.

[78]  Yonghong Yan,et al.  Experiments for an approach to language identification with conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[79]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[80]  William M. Campbell,et al.  Phonetic Speaker Recognition with Support Vector Machines , 2003, NIPS.

[81]  William M. Campbell,et al.  Advanced Language Recognition using Cepstra and Phonotactics: MITLL System Performance on the NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[82]  John C. Wells,et al.  Accents of English , 1982 .

[83]  Isabel Trancoso,et al.  Exploiting variety-dependent phones in portuguese variety identification applied to broadcast news transcription , 2010, INTERSPEECH.

[84]  William M. Campbell,et al.  A framework for discriminative SVM/GMM systems for language recognition , 2009, INTERSPEECH.

[85]  Philip Rose,et al.  FORENSIC SPEAKER DISCRIMINATION WITH AUSTRALIAN ENGLISH VOWEL ACOUSTICS , 2007 .

[86]  Chao Huang,et al.  Automatic accent identification using Gaussian mixture models , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[87]  John H. L. Hansen,et al.  Perceptual Recognition Cues in Native English Accent Variation: "Listener Accent, Perceived Accent, and Comprehension" , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[88]  David Gerhard,et al.  Pitch Extraction and Fundamental Frequency: History and Current Techniques , 2003 .

[89]  Sharath Pankanti,et al.  Evaluation techniques for biometrics-based authentication systems (FRR) , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[90]  Philip C. Woodland,et al.  Using accent-specific pronunciation modelling for improved large vocabulary continuous speech recognition , 1997, EUROSPEECH.

[91]  Aanchan Mohan Combining speech recognition and speaker verification , 2008 .

[92]  Rathinavelu Chengalvarayan,et al.  Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition , 1999, EUROSPEECH.

[93]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[94]  David A. van Leeuwen,et al.  On calibration of language recognition scores , 2006, Odyssey.

[95]  Martin J. Russell,et al.  Speech-based identification of social groups in a single accent of British English by humans and computers , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[96]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[97]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[98]  Ronald A. Cole,et al.  Perceptual benchmarks for automatic language identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[99]  Dennis H. Klatt,et al.  A digital filter bank for spectral matching , 1976, ICASSP.

[100]  John H. L. Hansen,et al.  Dialect/Accent Classification Using Unrestricted Audio , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[101]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[102]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[103]  F. Harris On the use of windows for harmonic analysis with the discrete Fourier transform , 1978, Proceedings of the IEEE.

[104]  Isabel Trancoso,et al.  Recognition of non-native accents , 1997, EUROSPEECH.

[105]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[106]  William M. Campbell,et al.  Language recognition with support vector machines , 2004, Odyssey.

[107]  David A. van Leeuwen,et al.  Channel-dependent GMM and Multi-class Logistic Regression models for language recognition , 2006, Odyssey.

[108]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[109]  Douglas E. Sturim,et al.  The MITLL NIST LRE 2009 language recognition system , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[110]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[111]  Pascal Druyts,et al.  Applying Logistic Regression to the Fusion of the NIST'99 1-Speaker Submissions , 2000, Digit. Signal Process..

[112]  William M. Campbell,et al.  Generalized linear discriminant sequence kernels for speaker recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[113]  Naga K. Govindaraju,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007 .

[114]  J. McClellan,et al.  A unified approach to the design of optimum FIR linear-phase digital filters , 1973 .

[115]  Alex Waibel,et al.  Vocal Tract Length Normalization for Large Vocabulary Continuous Speech Recognition , 1997 .

[116]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[117]  Ling Guan,et al.  An investigation of speech-based human emotion recognition , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[118]  Julia Hirschberg,et al.  On the correlation between energy and pitch accent in read English speech , 2006, INTERSPEECH.

[119]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[120]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[121]  Bingxi Wang,et al.  Automatic Language Identification using Support Vector Machines , 2006, 2006 8th international Conference on Signal Processing.

[122]  Murray Alpert,et al.  Emotion in Speech: The Acoustic Attributes of Fear, Anger, Sadness, and Joy , 1999, Journal of psycholinguistic research.

[123]  Martin Vetterli,et al.  Fast Fourier transforms: a tutorial review and a state of the art , 1990 .

[124]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[125]  J. A. Heinen,et al.  Classification of speech accents with neural networks , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[126]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[127]  H. Hermansky,et al.  Analysis of Speaker and Channel Variability in , 1999 .

[128]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[129]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[130]  Stephen Cox,et al.  A comparison of two unsupervised approaches to accent identification , 1998, ICSLP.

[131]  Hua Nong Ting,et al.  Speaker-independent Malay vowel recognition of children using multi-layer perceptron , 2004, 2004 IEEE Region 10 Conference TENCON 2004..

[132]  Hooi San Phoon,et al.  The Phonological Development of Malaysian English Speaking Chinese Children: A Normative Study. , 2010 .

[133]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[134]  William M. Campbell,et al.  Support vector machines for speaker verification and identification , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).

[135]  John D. Owens,et al.  Three-layer optimizations for fast GMM computations on GPU-like parallel processors , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[136]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[137]  Martin J. Russell,et al.  Computer and Human Recognition of Regional Accents of British English , 2011, INTERSPEECH.

[138]  Richard Coates Talking for Britain: A Journey Through the Nation’s Dialects: Simon Elmes , 2008 .

[139]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .