Reservoir Transformers

We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.

[1]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[2]  Misha Denil,et al.  Noisy Activation Functions , 2016, ICML.

[3]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[4]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[5]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[6]  Herbert Jaeger,et al.  Adaptive Nonlinear System Identification with Echo State Networks , 2002, NIPS.

[7]  Bohyung Han,et al.  Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization , 2017, NIPS.

[8]  Benjamin Recht,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[9]  Jason Yosinski,et al.  Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask , 2019, NeurIPS.

[10]  Toshiyuki Yamane,et al.  Recent Advances in Physical Reservoir Computing: A Review , 2018, Neural Networks.

[11]  A. Gamba,et al.  Further experiments with PAPA , 1961 .

[12]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[13]  Guillermo Sapiro,et al.  Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? , 2015, IEEE Transactions on Signal Processing.

[14]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[15]  Henry Markram,et al.  Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations , 2002, Neural Computation.

[16]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[17]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[18]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[19]  Eric B. Baum,et al.  On the capabilities of multilayer perceptrons , 1988, J. Complex..

[20]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[21]  M. C. Soriano,et al.  Information Processing Using Transient Dynamics of Semiconductor Lasers Subject to Delayed Feedback , 2013, IEEE Journal of Selected Topics in Quantum Electronics.

[22]  Xavier Serra,et al.  Randomly Weighted CNNs for (Music) Audio Classification , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[24]  John K. Tsotsos,et al.  Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing , 2018, 2019 16th Conference on Computer and Robot Vision (CRV).

[25]  Samuel R. Bowman,et al.  Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis , 2018, BlackboxNLP@EMNLP.

[26]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[27]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[28]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[29]  Claudio Gallicchio,et al.  Echo State Property of Deep Reservoir Computing Networks , 2017, Cognitive Computation.

[30]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[31]  Dianhui Wang,et al.  Randomness in neural networks: an overview , 2017, WIREs Data Mining Knowl. Discov..

[32]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[33]  C. Lee Giles,et al.  An analysis of noise in recurrent neural networks: convergence and generalization , 1996, IEEE Trans. Neural Networks.

[34]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[35]  Ali Farhadi,et al.  What’s Hidden in a Randomly Weighted Neural Network? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[37]  Dejan J. Sobajic,et al.  Learning and generalization characteristics of the random vector Functional-link net , 1994, Neurocomputing.

[38]  Herbert Jaeger,et al.  Reservoir computing approaches to recurrent neural network training , 2009, Comput. Sci. Rev..

[39]  Robert P. W. Duin,et al.  Feedforward neural networks with random weights , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Roy Schwartz,et al.  Random Feature Attention , 2021, ICLR.

[42]  Yuan Ni,et al.  PANLP at MEDIQA 2019: Pre-trained Language Models, Transfer Learning and Knowledge Distillation , 2019, BioNLP@ACL.

[43]  Lucy J. Colwell,et al.  Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers , 2020, ArXiv.

[44]  Andrea Vedaldi,et al.  Deep Image Prior , 2017, International Journal of Computer Vision.

[45]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[46]  Benjamin Schrauwen,et al.  Compact hardware for real-time speech recognition using a Liquid State Machine , 2007, 2007 International Joint Conference on Neural Networks.

[47]  A. Gamba,et al.  An outline of a mathematical theory of PAPA , 1961 .

[48]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[49]  Somnath Paul,et al.  Event-Driven Random Back-Propagation: Enabling Neuromorphic Deep Learning Machines , 2016, Front. Neurosci..

[50]  C. Lee Giles,et al.  Effects of Noise on Convergence and Generalization in Recurrent Networks , 1994, NIPS.