Machine Learning Applied to Aspirated and Non-Aspirated Allophone Classification–An Approach Based on Audio “Fingerprinting”

The purpose of this study is to involve both Convolutional Neural Networks and a typical learning algorithm in the allophone classification process. A list of words including aspirated and non-aspirated allophones pronounced by native and non-native English speakers is recorded and then edited and analyzed. Allophones extracted from English speakers’ recordings are presented in the form of two-dimensional spectrogram images and used as input to train the Convolutional Neural Networks. Various settings of the spectral representation are analyzed to determine adequate option for the allophone classification. Then, testing is performed on the basis of non-native speakers’ utterances. The same approach is repeated employing learning algorithm but based on feature vectors. The achieved classification results are promising as high accuracy is observed.

[1]  Andrzej Czyzewski,et al.  Objectivization of Phonological Evaluation of Speech Elements by Means of Audio Parametrization , 2018, 2018 11th International Conference on Human System Interaction (HSI).

[2]  P. Keating,et al.  Patterns in allophone distribution for voiced and voiceless stops , 1983 .

[3]  Yu Hu,et al.  A new method for mispronunciation detection using Support Vector Machine based on Pronunciation Space Models , 2009, Speech Commun..

[4]  Thomas Sikora,et al.  MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval , 2005 .

[5]  Bozena Kostek,et al.  Voiceless Stop Consonant Modelling and Synthesis Framework Based on Miso Dynamic System , 2017 .

[6]  Shean-Tsong Chiu,et al.  Bandwidth selection for kernel density estimation , 1991 .

[7]  Jana Vogel,et al.  Statistical Design And Analysis Of Experiments With Applications To Engineering And Science , 2016 .

[8]  Okko Johannes Räsänen,et al.  Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits , 2015, Comput. Speech Lang..

[9]  L. Lisker,et al.  A Cross-Language Study of Voicing in Initial Stops: Acoustical Measurements , 1964 .

[10]  Andrzej Czyzewski,et al.  Visual lip contour detection for the purpose of speech recognition , 2014, 2014 International Conference on Signals and Electronic Systems (ICSES).

[11]  J. Ghosh,et al.  An Introduction to Bayesian Analysis: Theory and Methods , 2006 .

[12]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[13]  Mark J. T. Smith,et al.  Adaptive frequency cepstral coefficients for word mispronunciation detection , 2011, 2011 4th International Congress on Image and Signal Processing.

[14]  Andrzej Czyzewski,et al.  Speech Analytics Based on Machine Learning , 2018, Machine Learning Paradigms.

[15]  Bozena Kostek,et al.  Examining feature vector for phoneme recognition , 2017, 2017 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[16]  Mark A. Girolami,et al.  An empirical analysis of the probabilistic K-nearest neighbour classifier , 2007, Pattern Recognit. Lett..

[17]  Nikhil Buduma,et al.  Fundamentals of deep learning , 2017 .