Training Language Models Using Target-Propagation

While Truncated Back-Propagation through Time (BPTT) is the most popular approach to training Recurrent Neural Networks (RNNs), it suffers from being inherently sequential (making parallelization difficult) and from truncating gradient flow between distant time-steps. We investigate whether Target Propagation (TPROP) style approaches can address these shortcomings. Unfortunately, extensive experiments suggest that TPROP generally underperforms BPTT, and we end with an analysis of this phenomenon, and suggestions for future work.

[1]  M. Hestenes Multiplier and gradient methods , 1969 .

[2]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[3]  Yoshua Bengio,et al.  Difference Target Propagation , 2014, ECML/PKDD.

[4]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[5]  R. Courant Variational methods for the solution of problems of equilibrium and vibrations , 1943 .

[6]  Yann LeCun,et al.  Modeles connexionnistes de l'apprentissage , 1987 .

[7]  Yann LeCun,et al.  Learning processes in an asymmetric threshold network , 1986 .

[8]  B. Mercier,et al.  A dual algorithm for the solution of nonlinear variational problems via finite element approximation , 1976 .

[9]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[10]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[11]  Ying Zhang,et al.  On Multiplicative Integration with Recurrent Neural Networks , 2016, NIPS.

[12]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[13]  Anders Krogh,et al.  A Cost Function for Internal Representations , 1989, NIPS.

[14]  Yann LeCun,et al.  A theoretical framework for back-propagation , 1988 .

[15]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[16]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[17]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[19]  R. Glowinski,et al.  Sur l'approximation, par éléments finis d'ordre un, et la résolution, par pénalisation-dualité d'une classe de problèmes de Dirichlet non linéaires , 1975 .

[20]  Yoshua Bengio,et al.  How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation , 2014, ArXiv.

[21]  Miguel Á. Carreira-Perpiñán,et al.  Hashing with binary autoencoders , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[23]  R. Fergus,et al.  Learning invariant features through topographic filter maps , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[25]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[26]  Yann LeCun,et al.  Dynamic Factor Graphs for Time Series Modeling , 2009, ECML/PKDD.

[27]  M. J. D. Powell,et al.  A method for nonlinear constraints in minimization problems , 1969 .