Acoustic template-matching for automatic emergency state detection: An ELM based algorithm

Extreme Learning Machine (ELM) represents a popular paradigm for training feedforward neural networks due to its fast learning time. This paper applies the technique for the automatic classification of speech utterances. Power Normalized Cepstral Coefficients (PNCC) are employed as feature vectors and ELM performs the final classification. Both the baseline ELM algorithm and ELM with kernel have been employed and tested. Due to the fixed number of input neurons in the ELM, a length normalization algorithm is employed to transform the PNCC sequence into a vector of fixed length. Length normalization has been performed using two techniques: the first is based on Dynamic Time Warping (DTW) distances, the second on the vectorized outerproduct of trajectory matrix. Experiments have been conducted on the TIDIGITS corpus, to assess the performance on an isolated speech recognition task, and on ITAAL, to validate the system in an emergency detection task in realistic acoustic conditions. The ELM approach has been compared to template matching based on Dynamic Time Warping and to a Support Vector Machine based speech recognizer. The obtained results demonstrated the effectiveness of the approach both in terms of recognition performance and execution times. In particular, classification based on PNCCs, DTW distances and ELM kernel resulted in the best performing algorithm both in terms of recognition accuracy and execution times.

[1]  Dianhui Wang,et al.  Extreme learning machines: a survey , 2011, Int. J. Mach. Learn. Cybern..

[2]  Jen-Tzung Chien,et al.  Large-Vocabulary Continuous Speech Recognition Systems: A Look at Some Recent Advances , 2012, IEEE Signal Processing Magazine.

[3]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[4]  Erik Cambria,et al.  Sentic Album: Content-, Concept-, and Context-Based Online Personal Photo Management System , 2012, Cognitive Computation.

[5]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[6]  Francesco Piazza,et al.  Comparative Evaluation of Single-Channel MMSE-Based Noise Reduction Schemes for Speech Recognition , 2010, J. Electr. Comput. Eng..

[7]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[8]  Jacek M. Zurada,et al.  Review and performance comparison of SVM- and ELM-based classifiers , 2014, Neurocomputing.

[9]  Francesco Piazza,et al.  Online sequential extreme learning machine in nonstationary environments , 2013, Neurocomputing.

[10]  Hongming Zhou,et al.  Extreme Learning Machine for Regression and Multiclass Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[11]  Carmen Peláez-Moreno,et al.  Robust ASR using Support Vector Machines , 2007, Speech Commun..

[12]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[13]  Paolo Gastaldo,et al.  An ELM-based model for affective analogical reasoning , 2015, Neurocomputing.

[14]  Guang-Bin Huang,et al.  Convex incremental extreme learning machine , 2007, Neurocomputing.

[15]  Xavier Anguera Miró Information retrieval-based dynamic time warping , 2013, INTERSPEECH.

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Samy Bengio,et al.  Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods , 2009 .

[18]  Francesco Piazza,et al.  A distributed system for recognizing home automation commands and distress calls in the Italian language , 2013, INTERSPEECH.

[19]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[20]  Francesco Piazza,et al.  An extreme learning machine approach for training Time Variant Neural Networks , 2008, APCCAS 2008 - 2008 IEEE Asia Pacific Conference on Circuits and Systems.

[21]  Qinyu. Zhu Extreme Learning Machine , 2013 .

[22]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[23]  Veronique Stouten,et al.  Robust Automatic Speech Recognition in Time-Varying Environments (Robuuste automatische spraakherkenning in een tijdsvariërende omgeving) , 2006 .

[24]  Sabu Emmanuel,et al.  ELM for the Classification of Music Genres , 2006, 2006 9th International Conference on Control, Automation, Robotics and Vision.

[25]  Björn W. Schuller,et al.  YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[26]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[27]  Joseph Picone,et al.  Hybrid SVM/HMM architectures for speech recognition , 2000, INTERSPEECH.

[28]  Yifan Gong,et al.  Robust Speech Recognition Using a Cepstral Minimum-Mean-Square-Error-Motivated Noise Suppressor , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Chanwoo Kim,et al.  Robust DTW-based recognition algorithm for hand-held consumer devices , 2005, IEEE Transactions on Consumer Electronics.

[30]  Erik Cambria,et al.  Common Sense Computing: From the Society of Mind to Digital Intuition and beyond , 2009, COST 2101/2102 Conference.

[31]  Judith Redi,et al.  Circular-ELM for the reduced-reference assessment of perceived image quality , 2013, Neurocomputing.

[32]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[33]  Sundaram Suresh,et al.  Fast learning Circular Complex-valued Extreme Learning Machine (CC-ELM) for real-valued classification problems , 2012, Inf. Sci..

[34]  Hongming Zhou,et al.  Optimization method based extreme learning machine for classification , 2010, Neurocomputing.

[35]  Carmen Peláez-Moreno,et al.  SVMs for Automatic Speech Recognition: A Survey , 2005, WNSP.

[36]  Ming Li,et al.  Confidence index dynamic time warping for language-independent embedded speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Chellu Chandra Sekhar,et al.  Class-specific GMM based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines , 2014, Speech Commun..

[38]  R. Anitha,et al.  Outerproduct of trajectory matrix for acoustic modeling using support vector machines , 2004, Proceedings of the 2004 14th IEEE Signal Processing Society Workshop Machine Learning for Signal Processing, 2004..

[39]  Björn W. Schuller,et al.  A Real-Time Speech Enhancement Framework in Noisy and Reverberated Acoustic Scenarios , 2012, Cognitive Computation.

[40]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[41]  Lei Chen,et al.  Enhanced random search based incremental extreme learning machine , 2008, Neurocomputing.

[42]  Haixun Wang,et al.  Semantic Multidimensional Scaling for Open-Domain Sentiment Analysis , 2014, IEEE Intelligent Systems.

[43]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[44]  Francesco Piazza,et al.  Power Normalized Cepstral Coefficients based supervectors and i-vectors for small vocabulary speech recognition , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).