论文信息 - Reservoir Transformers

Reservoir Transformers

We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear “reservoir” layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.

[1] Yi Tay,et al. Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[2] Roy Schwartz,et al. Random Feature Attention , 2021, ICLR.

[3] Shen Li,et al. PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers , 2021, ArXiv.

[4] Ryan P. Adams,et al. Randomized Automatic Differentiation , 2020, ICLR.

[5] Yi Tay,et al. Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[6] Garrison W. Cottrell,et al. ReZero is All You Need: Fast Convergence at Large Depth , 2020, UAI.

[7] David J. Schwab,et al. Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs , 2020, ICLR.

[8] Anna Rumshisky,et al. A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[9] M. Zaheer,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[10] Nikolaos Pappas,et al. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[11] Noah A. Smith,et al. Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation , 2020, ArXiv.

[12] Lucy J. Colwell,et al. Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers , 2020, ArXiv.

[13] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[14] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[15] Li Yang,et al. ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[16] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[17] Yiming Yang,et al. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[18] Ivan Titov,et al. Information-Theoretic Probing with Minimum Description Length , 2020, EMNLP.

[19] Yuan Cao,et al. Echo State Neural Machine Translation , 2020, ArXiv.

[20] Dan Klein,et al. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[21] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[22] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[23] Ali Farhadi,et al. What’s Hidden in a Randomly Weighted Neural Network? , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Noah A. Smith,et al. Improving Transformer Models by Reordering their Sublayers , 2019, ACL.

[25] Michael Auli,et al. Depth-Adaptive Transformer , 2019, ICLR.

[26] Michael Auli,et al. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[27] Edouard Grave,et al. Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[28] Michael W. Mahoney,et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[29] Oren Etzioni,et al. Green AI , 2019, Commun. ACM.

[30] Claudio Gallicchio,et al. Deep Randomized Neural Networks , 2020, INNSBDDL.

[31] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[32] Vitalii Zhelezniak,et al. Neural Language Priors , 2019, ArXiv.

[33] Yuan Ni,et al. PANLP at MEDIQA 2019: Pre-trained Language Models, Transfer Learning and Knowledge Distillation , 2019, BioNLP@ACL.

[34] Chris Pal,et al. On the impressive performance of randomly weighted encoders in summarization tasks , 2019, ACL 2019.

[35] Benoît Sagot,et al. What Does BERT Learn about the Structure of Language? , 2019, ACL.

[36] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[37] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[38] Dipanjan Das,et al. BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[39] Jason Yosinski,et al. Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask , 2019, NeurIPS.

[40] Samy Bengio,et al. Are All Layers Created Equal? , 2019, J. Mach. Learn. Res..

[41] Felix Wu,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[42] Douwe Kiela,et al. No Training Required: Exploring Random Encoders for Sentence Classification , 2019, ICLR.

[43] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[44] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[45] Toshiyuki Yamane,et al. Recent Advances in Physical Reservoir Computing: A Review , 2018, Neural Networks.

[46] Xavier Serra,et al. Randomly Weighted CNNs for (Music) Audio Classification , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47] Michael Carbin,et al. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[48] John K. Tsotsos,et al. Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing , 2018, 2019 16th Conference on Computer and Robot Vision (CRV).

[49] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[50] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[51] Samuel R. Bowman,et al. Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis , 2018, BlackboxNLP@EMNLP.

[52] Yuanzhi Li,et al. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[53] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.

[54] Andrea Vedaldi,et al. Deep Image Prior , 2017, International Journal of Computer Vision.

[55] Marc'Aurelio Ranzato,et al. Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[56] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[57] Bohyung Han,et al. Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization , 2017, NIPS.

[58] Theodore Lim,et al. FreezeOut: Accelerate Training by Progressively Freezing Layers , 2017, NIPS 2017.

[59] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[60] Claudio Gallicchio,et al. Echo State Property of Deep Reservoir Computing Networks , 2017, Cognitive Computation.

[61] Holger Schwenk,et al. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[62] Dianhui Wang,et al. Randomness in neural networks: an overview , 2017, WIREs Data Mining Knowl. Discov..

[63] Max Jaderberg,et al. Understanding Synthetic Gradients and Decoupled Neural Interfaces , 2017, ICML.

[64] Somnath Paul,et al. Event-Driven Random Back-Propagation: Enabling Neuromorphic Deep Learning Machines , 2016, Front. Neurosci..

[65] Alex Graves,et al. Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[66] H. Jaeger,et al. Unconventional Information Processing Systems , Novel Hardware : A Tour d ’ Horizon , 2017 .

[67] Misha Denil,et al. Noisy Activation Functions , 2016, ICML.

[68] Yoram Singer,et al. Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[69] Guillermo Sapiro,et al. Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? , 2015, IEEE Transactions on Signal Processing.

[70] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[71] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[72] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[73] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[74] M. C. Soriano,et al. Information Processing Using Transient Dynamics of Semiconductor Lasers Subject to Delayed Feedback , 2013, IEEE Journal of Selected Topics in Quantum Electronics.

[75] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[76] Herbert Jaeger,et al. Reservoir computing approaches to recurrent neural network training , 2009, Comput. Sci. Rev..

[77] AI Koan,et al. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[78] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[79] Benjamin Schrauwen,et al. Compact hardware for real-time speech recognition using a Liquid State Machine , 2007, 2007 International Joint Conference on Neural Networks.

[80] Chee Kheong Siew,et al. Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[81] Magnus Sahlgren,et al. An Introduction to Random Indexing , 2005 .

[82] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[83] Henry Markram,et al. Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations , 2002, Neural Computation.

[84] Herbert Jaeger,et al. Adaptive Nonlinear System Identification with Echo State Networks , 2002, NIPS.

[85] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[86] Andrew P. Bradley,et al. The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[87] C. Lee Giles,et al. An analysis of noise in recurrent neural networks: convergence and generalization , 1996, IEEE Trans. Neural Networks.

[88] Dejan J. Sobajic,et al. Learning and generalization characteristics of the random vector Functional-link net , 1994, Neurocomputing.

[89] C. Lee Giles,et al. Effects of Noise on Convergence and Generalization in Recurrent Networks , 1994, NIPS.

[90] Robert P. W. Duin,et al. Feedforward neural networks with random weights , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[91] Eric B. Baum,et al. On the capabilities of multilayer perceptrons , 1988, J. Complex..

[92] W. B. Johnson,et al. Extensions of Lipschitz mappings into Hilbert space , 1984 .

[93] Marvin Minsky,et al. Perceptrons: An Introduction to Computational Geometry , 1969 .

[94] Thomas M. Cover,et al. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[95] H. D. Block. The perceptron: a model for brain functioning. I , 1962 .

[96] A. Gamba,et al. Further experiments with PAPA , 1961 .

[97] A. Gamba,et al. An outline of a mathematical theory of PAPA , 1961 .