Empirical Analysis of Score Fusion Application to Combined Neural Networks for Open Vocabulary Spoken Term Detection

System combination, which combines the outputs of multiple systems or internal representations, is a powerful method to improve the performance of machine learning tasks and has been widely adopted in recent knowledge transfer learning. In this study, to describe how to extract effective knowledge from an ensemble of neural networks, we first examine several score fusions from an ensemble of neural networks tasked with open vocabulary spoken term detection, where the class probability of the neural network is utilized as a similarity metric; then, we investigate the trade-off between confusion and dark knowledge. From the experimental evaluation on open vocabulary spoken term detection, we obtain 2.09% absolute gain as compared to the best result from single systems. Furthermore, the performance gains achieved via score fusion of class probabilities exactly match the mathematical inequality for sum and power means results, and that the gain achieved via summation of class probabilities is consistently better than that achieved via score fusion of power means. The experimental analysis confirms that summation, which enhances the discriminative capability of the superior class probability, can implement smoothed probability distribution to yield more effective dark knowledge, while adequately suppressing undesirable effects.

[1]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[2]  Julia Hirschberg,et al.  Rescoring Confusion Networks for Keyword Search , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .

[4]  Brian Kingsbury,et al.  Knowledge distillation across ensembles of multilingual models for low-resource languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Mark J. F. Gales,et al.  Combining tandem and hybrid systems for improved speech recognition and keyword spotting on low resource languages , 2014, INTERSPEECH.

[6]  Tatsuya Kawahara,et al.  Overview of the NTCIR-10 SpokenDoc-2 Task , 2013, NTCIR.

[7]  H. Bourlard,et al.  A new keyword spotting approach based on iterative dynamic programming , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Li Deng,et al.  Ensemble deep learning for speech recognition , 2014, INTERSPEECH.

[9]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[10]  Kenney Ng,et al.  Subword-based approaches for spoken document retrieval , 2000, Speech Commun..

[11]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[12]  Brian Kingsbury,et al.  Exploiting diversity for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[14]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[15]  David Yarowsky,et al.  A keyword search system using open source software , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[16]  Bhuvana Ramabhadran,et al.  Efficient Knowledge Distillation from an Ensemble of Teachers , 2017, INTERSPEECH.

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Yevgen Chebotar,et al.  Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[19]  Zhiyuan Tang,et al.  Recurrent neural network training with dark knowledge transfer , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  William Chan,et al.  Transferring knowledge from a RNN to a DNN , 2015, INTERSPEECH.

[21]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[22]  Tara N. Sainath,et al.  Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.