Machine Learning–based Analysis of English Lateral Allophones

Abstract Automatic classification methods, such as artificial neural networks (ANNs), the k-nearest neighbor (kNN) and self-organizing maps (SOMs), are applied to allophone analysis based on recorded speech. A list of 650 words was created for that purpose, containing positionally and/or contextually conditioned allophones. For each word, a group of 16 native and non-native speakers were audio-video recorded, from which seven native speakers’ and phonology experts’ speech was selected for analyses. For the purpose of the present study, a sub-list of 103 words containing the English alveolar lateral phoneme /l/ was compiled. The list includes ‘dark’ (velarized) allophonic realizations (which occur before a consonant or at the end of the word before silence) and 52 ‘clear’ allophonic realizations (which occur before a vowel), as well as voicing variants. The recorded signals were segmented into allophones and parametrized using a set of descriptors, originating from the MPEG 7 standard, plus dedicated time-based parameters as well as modified MFCC features proposed by the authors. Classification methods such as ANNs, the kNN and the SOM were employed to automatically detect the two types of allophones. Various sets of features were tested to achieve the best performance of the automatic methods. In the final experiment, a selected set of features was used for automatic evaluation of the pronunciation of dark /l/ by non-native speakers.

[1]  R. L. K. Venkateswarlu,et al.  Novel approach for speech recognition by using self — Organized maps , 2011 .

[2]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[3]  Heinz J. Giegerich,et al.  English Phonology: An Introduction , 1992 .

[4]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Andrzej Czyzewski,et al.  Real-Time Speech Signal Segmentation Methods , 2013 .

[6]  Ryszard Tadeusiewicz,et al.  Acoustic analysis assessment in speech pathology detection , 2015, Int. J. Appl. Math. Comput. Sci..

[7]  Bozena Kostek,et al.  Music Mood Visualization Using Self-Organizing Maps , 2015 .

[8]  K. Moll,et al.  Cinefluorographic Study of Selected Allophones of English /I/ , 1975, Phonetica.

[9]  Mariusz Ziólko,et al.  Time Durations of Phonemes in Polish Language for Speech and Speaker Recognition , 2009, LTC.

[10]  Thomas Sikora,et al.  MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval , 2005 .

[11]  Andrzej Czyzewski,et al.  Visual lip contour detection for the purpose of speech recognition , 2014, 2014 International Conference on Signals and Electronic Systems (ICSES).

[12]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[13]  Bozena Kostek,et al.  Report of the ISMIS 2011 Contest: Music Information Retrieval , 2011, ISMIS.

[14]  Andrzej Czyzewski,et al.  Language material for English audiovisual speech recognition system development , 2013 .

[15]  Krzysztof Marasek,et al.  Deep Belief Neural Networks and Bidirectional Long-Short Term Memory Hybrid for Speech Recognition , 2015 .

[16]  Yu Song,et al.  Feature extraction and classification for audio information in news video , 2009, 2009 International Conference on Wavelet Analysis and Pattern Recognition.

[17]  Richard B. Reilly,et al.  VALID: A New Practical Audio-Visual Database, and Comparative Results , 2005, AVBPA.

[18]  Hynek Hermansky,et al.  Spectral entropy based feature for robust ASR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[20]  Daniel Recasens,et al.  A cross-language acoustic study of initial and final allophones of /l/ , 2012, Speech Commun..

[21]  Andrzej Czyzewski,et al.  Multimodal English corpus for automatic speech recognition , 2013, 2013 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA).

[22]  Mahesh Chandra,et al.  Multiple camera in car audio-visual speech recognition using phonetic and visemic information , 2015, Comput. Electr. Eng..

[23]  Stephen J. Cox,et al.  Improved speaker independent lip reading using speaker adaptive training and deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Frédéric Bimbot,et al.  BL-Database: A French audiovisual database for speech driven lip animation systems , 2011 .

[25]  Elias Pampalk,et al.  Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps , 2002, ICANN.

[26]  Jana Trojanová,et al.  Design and Recording of Czech Audio-Visual Database with Impaired Conditions for Continuous Speech Recognition , 2008, LREC.

[27]  Andrzej Czyzewski,et al.  Objectivization of Phonological Evaluation of Speech Elements by Means of Audio Parametrization , 2018, 2018 11th International Conference on Human System Interaction (HSI).

[28]  Ryszard A. Makowski,et al.  Automatic speech signal segmentation based on the innovation adaptive filter , 2014, Int. J. Appl. Math. Comput. Sci..

[29]  Hugo Van hamme,et al.  Gaussian Selection Using Self-Organizing Map for Automatic Speech Recognition , 2011, WSOM.

[30]  J. McQueen,et al.  Allophones, not phonemes in spoken-word recognition , 2018 .

[31]  Adam Dabrowski,et al.  Allophones in automatic whispery speech recognition , 2016, 2016 21st International Conference on Methods and Models in Automation and Robotics (MMAR).

[32]  Tomasz Jadczyk,et al.  AGH corpus of Polish speech , 2016, Lang. Resour. Evaluation.

[33]  Noël Nguyen,et al.  Automatic recognition of regional phonological variation in conversational interaction , 2010, Speech Commun..

[34]  Andrzej Czyzewski,et al.  An audio-visual corpus for multimodal automatic speech recognition , 2017, Journal of Intelligent Information Systems.

[35]  Jan Van der Spiegel,et al.  An acoustic-phonetic feature-based system for automatic phoneme recognition in continuous speech , 1999, ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349).

[36]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[37]  A. A. Beex,et al.  Automatic phoneme recognition with Segmental Hidden Markov Models , 2011, 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).

[38]  Krzysztof Krawiec,et al.  Exploring complex and big data , 2017, Int. J. Appl. Math. Comput. Sci..

[39]  Andrzej Czyzewski,et al.  The Project IDENT: Multimodal Biometric System for Bank Client Identity Verification , 2017, MCSS.

[40]  Piotr Klosowski Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling , 2017, EURASIP J. Audio Speech Music. Process..

[41]  Andrzej Czyzewski,et al.  Speech Analytics Based on Machine Learning , 2018, Machine Learning Paradigms.