Improving language-universal feature extraction with deep maxout and convolutional neural networks

When deployed in automated speech recognition (ASR), deep neural networks (DNNs) can be treated as a complex feature extractor plus a simple linear classifier. Previous work has investigated the utility of multilingual DNNs acting as language-universal feature extractors (LUFEs). In this paper, we explore different strategies to further improve LUFEs. First, we replace the standard sigmoid nonlinearity with the recently proposed maxout units. The resulting maxout LUFEs have the nice property of generating sparse feature representations. Second, the convolutional neural network (CNN) architecture is applied to obtain more invariant feature space. We evaluate the performance of LUFEs on a cross-language ASR task. Each of the proposed techniques results in word error rate reduction compared with the existing DNN-based LUFEs. Combining the two methods together brings additional improvement on the target language.

[1]  Wonkyum Lee,et al.  Modular combination of deep neural networks for acoustic modeling , 2013, INTERSPEECH.

[2]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[3]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Yajie Miao,et al.  Kaldi+PDNN: Building DNN-based ASR Systems with Kaldi and PDNN , 2014, ArXiv.

[5]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[6]  Florian Metze,et al.  Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Florian Metze,et al.  Subspace mixture model for low-resource speech recognition in cross-lingual settings , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Jiquan Ngiam,et al.  Sparse Filtering , 2011, NIPS.

[10]  Meng Cai,et al.  Deep maxout neural networks for speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[11]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[12]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[13]  Sridhar Krishna Nemala,et al.  Sparse coding for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Alexander I. Rudnicky,et al.  Using conversational word bursts in spoken term detection , 2013, INTERSPEECH.

[15]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[16]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[19]  Li Deng,et al.  Are Sparse Representations Rich Enough for Acoustic Modeling? , 2012, INTERSPEECH.

[20]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[21]  Ronald J. MacGregor,et al.  9 – Neural Networks , 1987 .

[22]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Hynek Hermansky,et al.  Multilayer perceptron with sparse hidden outputs for phoneme recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Florian Metze,et al.  Deep maxout networks for low-resource speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[25]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[26]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[28]  László Tóth Phone recognition with deep sparse rectifier neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[30]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Alexander I. Rudnicky,et al.  Combination of FST and CN search in spoken term detection , 2014, INTERSPEECH.

[32]  Florian Metze,et al.  Towards speaker adaptive training of deep neural network acoustic models , 2014, INTERSPEECH.

[33]  Tanja Schultz,et al.  Neighbour selection and adaptation for rapid speaker-dependent ASR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[34]  Florian Metze,et al.  Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training , 2013, INTERSPEECH.

[35]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[36]  Florian Metze,et al.  Distributed learning of multilingual DNN feature extractors using GPUs , 2014, INTERSPEECH.

[37]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.