Deep learning with maximal figure-of-merit cost to advance multi-label speech attribute detection

In this work, we are interested in boosting speech attribute detection by formulating it as a multi-label classification task, and deep neural networks (DNNs) are used to design speech attribute detectors. A straightforward way to tackle the speech attribute detection task is to estimate DNN parameters using the mean squared error (MSE) loss function and employ a sigmoid function in the DNN output nodes. A more principled way is nonetheless to incorporate the micro-F1 measure, which is a widely used metric in the multi-label classification, into the DNN loss function to directly improve the metric of interest at training time. Micro-F1 is not differentiable, yet we overcome such a problem by casting our task under the maximal figure-of-merit (MFoM) learning framework. The results demonstrate that our MFoM approach consistently outperforms the baseline systems.

[1]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[2]  Robert Lew,et al.  Francis Katamba, An Introduction to Phonology . Harlow, Essex: Longman. 1989. ISBN 0-582-29150-X. £9.95 Pb. Pp xvi + 328. , 1992, Journal of the International Phonetic Association.

[3]  Ellen Eide Distinctive features for use in an automatic speech recognition system , 2001, INTERSPEECH.

[4]  Florian Metze,et al.  A flexible stream architecture for ASR using articulatory features , 2002, INTERSPEECH.

[5]  Johannes Fürnkranz,et al.  Large-Scale Multi-label Text Classification - Revisiting Neural Networks , 2013, ECML/PKDD.

[6]  Mohammad S. Sorower A Literature Survey on Algorithms for Multi-label Learning , 2010 .

[7]  Carol Y. Espy-Wilson,et al.  Knowledge-based parameters for HMM speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Jinyu Li,et al.  Detection-based ASR in the automatic speech attribute transcription project , 2007, INTERSPEECH.

[9]  Chin-Hui Lee,et al.  A maximal figure-of-merit (MFoM)-learning approach to robust classifier design for text categorization , 2006, ACM Trans. Inf. Syst..

[10]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[11]  Ronald A. Cole,et al.  The OGI multi-language telephone speech corpus , 1992, ICSLP.

[12]  Zhi-Hua Zhou,et al.  Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[13]  Chin-Hui Lee,et al.  A maximal figure-of-merit learning approach to maximizing mean average precision with deep neural network based classifiers , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Chin-Hui Lee,et al.  An Information-Extraction Approach to Speech Processing: Analysis, Detection, Verification, and Recognition , 2013, Proceedings of the IEEE.

[15]  Satoshi Nakamura,et al.  An HMM acoustic model incorporating various additional knowledge sources , 2007, INTERSPEECH.

[16]  Chin-Hui Lee,et al.  Towards knowledge-based features for HMM based large vocabulary automatic speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Katrin Kirchhoff Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments , 1998, ICSLP.

[19]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[20]  Sabato Marco Siniscalchi,et al.  Boosting universal speech attributes classification with deep neural network for foreign accent characterization , 2015, INTERSPEECH.

[21]  Chin-Hui Lee,et al.  High-Accuracy Phone Recognition By Combining High-Performance Lattice Generation and Knowledge Based Rescoring , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[22]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[23]  Chin-Hui Lee,et al.  Universal attribute characterization of spoken languages for automatic spoken language recognition , 2013, Comput. Speech Lang..