The Evolved Transformer

Recent works have highlighted the strength of the Transformer architecture on sequence tasks while, at the same time, neural architecture search (NAS) has begun to outperform human-designed models. Our goal is to apply NAS to search for a better alternative to the Transformer. We first construct a large search space inspired by the recent advances in feed-forward sequence models and then run evolutionary architecture search with warm starting by seeding our initial population with the Transformer. To directly search on the computationally expensive WMT 2014 English-German translation task, we develop the Progressive Dynamic Hurdles method, which allows us to dynamically allocate more resources to more promising candidate models. The architecture found in our experiments -- the Evolved Transformer -- demonstrates consistent improvement over the Transformer on four well-established language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English-Czech and LM1B. At a big model size, the Evolved Transformer establishes a new state-of-the-art BLEU score of 29.8 on WMT'14 English-German; at smaller sizes, it achieves the same quality as the original "big" Transformer with 37.6% less parameters and outperforms the Transformer by 0.7 BLEU at a mobile-friendly model size of 7M parameters.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Quoc V. Le,et al.  Efficient Neural Architecture Search via Parameter Sharing , 2018, ICML.

[3]  Ameet Talwalkar,et al.  Non-stochastic Best Arm Identification and Hyperparameter Optimization , 2015, AISTATS.

[4]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5]  Ramesh Raskar,et al.  Designing Neural Network Architectures using Reinforcement Learning , 2016, ICLR.

[6]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[7]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[8]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[9]  Theodore Lim,et al.  SMASH: One-Shot Model Architecture Search through HyperNetworks , 2017, ICLR.

[10]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[11]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[12]  Kenji Doya,et al.  Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , 2017, Neural Networks.

[13]  Richard Socher,et al.  Weighted Transformer Network for Machine Translation , 2017, ArXiv.

[14]  Alan L. Yuille,et al.  Genetic CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Li Fei-Fei,et al.  Progressive Neural Architecture Search , 2017, ECCV.

[16]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[17]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[18]  Quoc V. Le,et al.  Large-Scale Evolution of Image Classifiers , 2017, ICML.

[19]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[20]  Kalyanmoy Deb,et al.  A Comparative Analysis of Selection Schemes Used in Genetic Algorithms , 1990, FOGA.

[21]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[22]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[25]  Aaron Klein,et al.  Towards Automated Deep Learning: Efficient Joint Neural Architecture and Hyperparameter Search , 2018, ArXiv.

[26]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[28]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[29]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[30]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[31]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[32]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[33]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[34]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[35]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[36]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[37]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[38]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[39]  Liang Lin,et al.  SNAS: Stochastic Neural Architecture Search , 2018, ICLR.

[40]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .