论文信息 - Jointly Learning to Locate and Classify Words Using Convolutional Networks

Jointly Learning to Locate and Classify Words Using Convolutional Networks

In this paper, we propose a novel approach for weaklysupervised word recognition. Most state of the art automatic speech recognition systems are based on frame-level labels obtained through forced alignments or through a sequential loss. Recently, weakly-supervised trained models have been proposed in vision, that can learn which part of the input is relevant for classifying a given pattern [1]. Our system is composed of a convolutional neural network and a temporal score aggregation mechanism. For each sentence, it is trained using as supervision only some of the words (most frequent) that are present in a given sentence, without knowing their order nor quantity. We show that our proposed system is able to jointly classify and localize words. We also evaluate the system on a keyword spotting task, and show that it can yield similar performance to strong supervised HMM/GMM baseline.

[1] Ronan Collobert,et al. From image-level to pixel-level labeling with Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[3] Jonathan G. Fiscus,et al. Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[4] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5] Dimitri Palaz,et al. End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks , 2013, ArXiv.

[6] Atsushi Nakamura,et al. Integrating Deep Neural Networks into Structural Classification Approach based on Weighted Finite-State Transducers , 2012, INTERSPEECH.

[7] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[8] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Clément Farabet,et al. Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[10] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[11] Yann LeCun,et al. Generalization and network design strategies , 1989 .

[12] Eric Fosler-Lussier,et al. Conditional Random Fields for Integrating Local Discriminative Classifiers , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13] eon BottouAT. Stochastic Gradient Learning in Neural Networks , 2022 .

[14] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[15] Yoshua Bengio,et al. Global optimization of a neural network-hidden Markov model hybrid , 1992, IEEE Trans. Neural Networks.

[16] Dimitri Palaz,et al. Joint phoneme segmentation inference and classification using CRFs , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[17] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[18] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[19] Dong Yu,et al. Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.

[20] Murat Saraclar,et al. Lattice Indexing for Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[22] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[23] Alex Graves,et al. Recurrent Models of Visual Attention , 2014, NIPS.

[24] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.