Speech Analytics Based on Machine Learning

In this chapter, the process of speech data preparation for machine learning is discussed in detail. Examples of speech analytics methods applied to phonemes and allophones are shown. Further, an approach to automatic phoneme recognition involving optimized parametrization and a classifier belonging to machine learning algorithms is discussed. Feature vectors are built on the basis of descriptors coming from the music information retrieval (MIR) domain. Then, phoneme classification beyond the typically used techniques is extended towards exploring Deep Neural Networks (DNNs). This is done by combining Convolutional Neural Networks (CNNs) with audio data converted to the time-frequency space domain (i.e. spectrograms) and then exported as images. In this way a two-dimensional representation of speech feature space is employed. When preparing the phoneme dataset for CNNs, zero padding and interpolation techniques are used. The obtained results show an improvement in classification accuracy in the case of allophones of the phoneme /l/, when CNNs coupled with spectrogram representation are employed. Contrarily, in the case of vowel classification, the results are better for the approach based on pre-selected features and a conventional machine learning algorithm.

[1]  Andrzej Czyzewski,et al.  An audio-visual corpus for multimodal automatic speech recognition , 2017, Journal of Intelligent Information Systems.

[2]  Sung Wook Baik,et al.  Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network , 2017, 2017 International Conference on Platform Technology and Service (PlatCon).

[3]  Yi Wang,et al.  Speaker recognition based on MFCC and BP neural networks , 2017, 2017 28th Irish Signals and Systems Conference (ISSC).

[4]  Mahesh Chandra,et al.  Multiple camera in car audio-visual speech recognition using phonetic and visemic information , 2015, Comput. Electr. Eng..

[5]  Mariusz Ziólko,et al.  Time Durations of Phonemes in Polish Language for Speech and Speaker Recognition , 2009, LTC.

[6]  B. S. Manjunath,et al.  Introduction to MPEG-7: Multimedia Content Description Interface , 2002 .

[7]  Douglas D. O'Shaughnessy,et al.  Low-variance Multitaper Mel-frequency Cepstral Coefficient Features for Speech and Speaker Recognition Systems , 2013, Cognitive Computation.

[8]  Bozena Kostek,et al.  Automatic music genre classification based on musical instrument track separation , 2018, Journal of Intelligent Information Systems.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  John M. Grace,et al.  Speaker Recognition using MFCC and Vector Quantization , 2015 .

[12]  Gaël Richard,et al.  Automatic transcription of drum loops , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Hyungsuk Kim,et al.  Time-domain calculation of spectral centroid from backscattered ultrasound signals , 2012, IEEE Transactions on Ultrasonics, Ferroelectrics and Frequency Control.

[14]  Lawrence J. Raphael,et al.  Comprar Speech science primer | Lawrence J Raphael | 9781608313570 | Lippincott Williams & Wilkins , 2011 .

[15]  Gholamreza Anbarjafari,et al.  Supervised Vocal-Based Emotion Recognition Using Multiclass Support Vector Machine, Random Forests, and Adaboost , 2017 .

[16]  Mark A. Girolami,et al.  An empirical analysis of the probabilistic K-nearest neighbour classifier , 2007, Pattern Recognit. Lett..

[17]  Daniel P. W. Ellis,et al.  Speech and Audio Signal Processing - Processing and Perception of Speech and Music, Second Edition , 1999 .

[18]  Bozena Kostek,et al.  Voiceless Stop Consonant Modelling and Synthesis Framework Based on Miso Dynamic System , 2017 .

[19]  Thomas Sikora,et al.  MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval , 2005 .

[20]  Francois Leonard,et al.  Phase spectrogram and frequency spectrogram as new diagnostic tools , 2007 .

[21]  Jagannath H. Nirmal,et al.  A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network , 2015, 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR).

[22]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[23]  Kenneth Sundaraj,et al.  A comparative study of the svm and k-nn machine learning algorithms for the diagnosis of respiratory pathologies using pulmonary acoustic signals , 2014, BMC Bioinformatics.

[24]  Mohammad Hossein Moattar,et al.  A simple but efficient real-time Voice Activity Detection algorithm , 2009, 2009 17th European Signal Processing Conference.

[25]  Bozena Kostek,et al.  Examining feature vector for phoneme recognition , 2017, 2017 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[26]  Bozena Kostek,et al.  Report of the ISMIS 2011 Contest: Music Information Retrieval , 2011, ISMIS.

[27]  Carsten O. Daub,et al.  Measuring Distances Between Variables by Mutual Information , 2005 .

[28]  Okko Johannes Räsänen,et al.  Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits , 2015, Comput. Speech Lang..

[29]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[30]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[31]  Bozena Kostek,et al.  Music Mood Visualization Using Self-Organizing Maps , 2015 .

[32]  Akinori Nishihara,et al.  Efficient voice activity detection algorithm using long-term spectral flatness measure , 2013, EURASIP J. Audio Speech Music. Process..

[33]  Haizhou Li,et al.  Spectrogram Image Feature for Sound Event Classification in Mismatched Conditions , 2011, IEEE Signal Processing Letters.

[34]  Alexander Lerch An introduction to audio content analysis , 2012 .

[35]  K.M.M. Prabhu,et al.  Window Functions and Their Applications in Signal Processing , 2013 .

[36]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.