论文信息 - Task Loss Estimation for Sequence Prediction - 字舞流文

Task Loss Estimation for Sequence Prediction

Often, the performance on a supervised machine learning task is evaluated with a emph{task loss} function that cannot be optimized directly. Examples of such loss functions include the classification error, the edit distance and the BLEU score. A common workaround for this problem is to instead optimize a emph{surrogate loss} function, such as for instance cross-entropy or hinge loss. In order for this remedy to be effective, it is important to ensure that minimization of the surrogate loss results in minimization of the task loss, a condition that we call emph{consistency with the task loss}. In this work, we propose another method for deriving differentiable surrogate losses that provably meet this requirement. We focus on the broad class of models that define a score for every input-output pair. Our idea is that this score can be interpreted as an estimate of the task loss, and that the estimation error may be used as a consistent surrogate loss. A distinct feature of such an approach is that it defines the desirable value of the score for every input-output pair. We use this property to design specialized surrogate losses for Encoder-Decoder models often used for sequence prediction tasks. In our experiment, we benchmark on the task of speech recognition. Using a new surrogate loss instead of cross-entropy to train an Encoder-Decoder speech recognizer brings a significant ~13% relative improvement in terms of Character Error Rate (CER) in the case when no extra corpora are used for language modeling.

Yoshua Bengio | Aaron C. Courville | Dmitriy Serdyuk | Dzmitry Bahdanau | Philemon Brakel | Jan Chorowski | Nan Rosemary Ke | Yoshua Bengio | Dzmitry Bahdanau | J. Chorowski | Dmitriy Serdyuk | Philemon Brakel

[1] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[2] Razvan Pascanu,et al. Theano: new features and speed improvements , 2012, ArXiv.

[3] Veselin Stoyanov,et al. Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure , 2011, AISTATS.

[4] Thomas Hofmann,et al. Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[5] Yoshua Bengio,et al. Blocks and Fuel: Frameworks for deep learning , 2015, ArXiv.

[6] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.

[9] Jianfeng Gao,et al. Training MRF-Based Phrase Translation Models using Gradient Ascent , 2013, NAACL.

[10] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[11] Yajie Miao,et al. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[12] Tamir Hazan,et al. Direct Loss Minimization for Structured Prediction , 2010, NIPS.

[13] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[14] Wu Chou,et al. Discriminative learning in sequential pattern recognition , 2008, IEEE Signal Processing Magazine.

[15] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[16] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[17] Xinyun Chen. Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[18] Daniel Povey,et al. Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] David A. Smith,et al. Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[21] Justin Domke,et al. Generic Methods for Optimization-Based Modeling , 2012, AISTATS.

[22] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[23] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[24] Fu Jie Huang,et al. A Tutorial on Energy-Based Learning , 2006 .

[25] David A. McAllester,et al. Direct Error Rate Minimization of Hidden Markov Models , 2011, INTERSPEECH.

[26] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[27] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[28] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[29] Daniel Jurafsky,et al. First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs , 2014, ArXiv.