Large Margin Neural Language Model

We propose a large margin criterion for training neural language models. Conventionally, neural language models are trained by minimizing perplexity (PPL) on grammatical sentences. However, we demonstrate that PPL may not be the best metric to optimize in some tasks, and further propose a large margin formulation. The proposed method aims to enlarge the margin between the “good” and “bad” sentences in a task-specific sense. It is trained end-to-end and can be widely applied to tasks that involve re-scoring of generated text. Compared with minimum-PPL training, our method gains up to 1.1 WER reduction for speech recognition and 1.0 BLEU increase for machine translation.

[1]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[2]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[3]  Jianfeng Gao,et al.  Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models , 2014, ACL.

[4]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[5]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[6]  Alexander M. Rush,et al.  Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[7]  Jonas Mueller,et al.  Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[8]  Shankar Kumar,et al.  Approaches for Neural-Network Language Model Adaptation , 2017, INTERSPEECH.

[9]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[10]  Mark Hopkins,et al.  Tuning as Ranking , 2011, EMNLP.

[11]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[12]  Mark J. F. Gales,et al.  Recurrent neural network language model adaptation for multi-genre broadcast speech recognition , 2015, INTERSPEECH.

[13]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[14]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[15]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[16]  Adam Coates,et al.  Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[17]  Chin-Hui Lee,et al.  Discriminative training of language models for speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Mark J. F. Gales,et al.  Improved neural network based language modelling and adaptation , 2010, INTERSPEECH.

[19]  S. Khudanpur,et al.  Large-scale Discriminative n-gram Language Models for Statistical Machine Translation , 2008, AMTA.

[20]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[21]  Fuchun Peng,et al.  Search results based N-best hypothesis rescoring with maximum entropy classification , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[22]  Brian Roark,et al.  Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[23]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[24]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Yang Liu,et al.  Minimum Risk Training for Neural Machine Translation , 2015, ACL.

[26]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[27]  Maarten Versteegh,et al.  Learning Text Similarity with Siamese Recurrent Networks , 2016, Rep4NLP@ACL.

[28]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[29]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[30]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[31]  Bo-June Paul Hsu,et al.  Generalized linear interpolation of language models , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[32]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[33]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[34]  Tong Zhang,et al.  Statistical Analysis of Bayes Optimal Subset Ranking , 2008, IEEE Transactions on Information Theory.

[35]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[36]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[37]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[38]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[39]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[40]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[41]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Brian Roark,et al.  Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[43]  Xiangang Li,et al.  Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling , 2017, ICML.

[44]  Ethem Alpaydin,et al.  Classification and Ranking Approaches to Discriminative Language Modeling for ASR , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[46]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[47]  Yuuki Tachioka,et al.  Discriminative method for recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[49]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .