Optimization of RNN-Based Speech Activity Detection

Speech activity detection (SAD) is an essential component of automatic speech recognition systems impacting the overall system performance. This paper investigates an optimization process for recurrent neural network (RNN) based SAD. This process optimizes all system parameters including those used for feature extraction, the NN weights, and the back-end parameters. Three cost functions are considered for SAD optimization: the frame error rate, the NIST detection cost function, and the word error rate of a downstream speech recognizer. Different types of RNN models and optimization methods are investigated. Three types of RNNs are compared: a basic RNN, long short-term memory (LSTM) network with peepholes, and a coordinated-gate LSTM (CG-LSTM) network introduced by Gelly and Gauvain. Well suited for nondifferentiable optimization problems, quantum-behaved particle swarm optimization is used to optimize feature extraction and posterior smoothing, as well as for the initial training of the neural networks. Experimental SAD results are reported on the NIST 2015 SAD evaluation data as well as REPERE and AMI meeting corpora. Speech recognition results are reported on the OpenKWS’13 test data. For all tasks and conditions, the proposed optimization method significantly improves the SAD performance and among all the tested SAD methods the CG-LSTM model gives the best results.

[1]  Xiao Fu,et al.  Quantum Behaved Particle Swarm Optimization with Neighborhood Search for Numerical Optimization , 2013 .

[2]  Peder A. Olsen,et al.  Voicing features for robust speech detection , 2005, INTERSPEECH.

[3]  Nozomu Hamada,et al.  Noise robust Voice Activity Detection for multiple speakers , 2010, 2010 International Symposium on Intelligent Signal Processing and Communication Systems.

[4]  Gregory Gelly,et al.  Neural Networks as a Guidance Solution for Soft-Landing and Aerocapture , 2009 .

[5]  Sridha Sridharan,et al.  Noise robust voice activity detection using features extracted from the time-domain autocorrelation function , 2010, INTERSPEECH.

[6]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[7]  Yusuke Kida,et al.  Voice Activity Detection: Merging Source and Filter-based Information , 2016, IEEE Signal Processing Letters.

[8]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[9]  Aaron E. Rosenberg,et al.  An improved endpoint detector for isolated word recognition , 1981 .

[10]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Kai Yu,et al.  A comparative study of robustness of deep learning approaches for VAD , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Fei Xie,et al.  A comparative study of speech detection methods , 1997, EUROSPEECH.

[13]  Jean-Luc Gauvain,et al.  Developing STT and KWS systems using limited language resources , 2014, INTERSPEECH.

[14]  Brian Kingsbury,et al.  Improvements to the IBM speech activity detection system for the DARPA RATS program , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Izhak Shafran,et al.  Robust speech detection and segmentation for real-time ASR applications , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  Maurice Clerc,et al.  The particle swarm - explosion, stability, and convergence in a multidimensional complex space , 2002, IEEE Trans. Evol. Comput..

[17]  Russell C. Eberhart,et al.  A new optimizer using particle swarm theory , 1995, MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science.

[18]  Petros Maragos,et al.  Speech event detection using multiband modulation energy , 2005, INTERSPEECH.

[19]  Jean-Luc Gauvain,et al.  Minimum word error training of RNN-based voice activity detection , 2015, INTERSPEECH.

[20]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[22]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[23]  Alex Graves,et al.  Supervised Sequence Labelling , 2012 .

[24]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[25]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[26]  Chau Khoa. Pham Noise robust voice activity detection , 2013 .

[27]  Surya Ganguli,et al.  An adaptive low dimensional quasi-Newton sum of functions optimizer , 2013, ArXiv.

[28]  Wenbo Xu,et al.  Particle swarm optimization with particles having quantum behavior , 2004, Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753).

[29]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[30]  Olivier Galibert,et al.  A presentation of the REPERE challenge , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[31]  Paul Gay Segmentation et identification audiovisuelle de personnes dans des journaux télévisés. (Audiovisual segmentation and identification of persons in broadcast news) , 2015 .

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Xiaojun Wu,et al.  Convergence analysis and improvements of quantum-behaved particle swarm optimization , 2012, Inf. Sci..

[35]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[36]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[37]  Razvan Pascanu,et al.  Advances in optimizing recurrent networks , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[39]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[41]  Giovanni Soda,et al.  Exploiting the past and the future in protein secondary structure prediction , 1999, Bioinform..

[42]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[43]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[44]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.