论文信息 - The Evolved Transformer - 字舞流文

The Evolved Transformer

Recent works have highlighted the strength of the Transformer architecture on sequence tasks while, at the same time, neural architecture search (NAS) has begun to outperform human-designed models. Our goal is to apply NAS to search for a better alternative to the Transformer. We first construct a large search space inspired by the recent advances in feed-forward sequence models and then run evolutionary architecture search with warm starting by seeding our initial population with the Transformer. To directly search on the computationally expensive WMT 2014 English-German translation task, we develop the Progressive Dynamic Hurdles method, which allows us to dynamically allocate more resources to more promising candidate models. The architecture found in our experiments -- the Evolved Transformer -- demonstrates consistent improvement over the Transformer on four well-established language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English-Czech and LM1B. At a big model size, the Evolved Transformer establishes a new state-of-the-art BLEU score of 29.8 on WMT'14 English-German; at smaller sizes, it achieves the same quality as the original "big" Transformer with 37.6% less parameters and outperforms the Transformer by 0.7 BLEU at a mobile-friendly model size of 7M parameters.

Quoc V. Le | Chen Liang | David R. So | Chen Liang

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Quoc V. Le,et al. Efficient Neural Architecture Search via Parameter Sharing , 2018, ICML.

[3] Ameet Talwalkar,et al. Non-stochastic Best Arm Identification and Hyperparameter Optimization , 2015, AISTATS.

[4] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5] Ramesh Raskar,et al. Designing Neural Network Architectures using Reinforcement Learning , 2016, ICLR.

[6] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[7] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[8] Andrew L. Maas. Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[9] Theodore Lim,et al. SMASH: One-Shot Model Architecture Search through HyperNetworks , 2017, ICLR.

[10] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[11] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[12] Kenji Doya,et al. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , 2017, Neural Networks.

[13] Richard Socher,et al. Weighted Transformer Network for Machine Translation , 2017, ArXiv.

[14] Alan L. Yuille,et al. Genetic CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15] Li Fei-Fei,et al. Progressive Neural Architecture Search , 2017, ECCV.

[16] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[17] Yiming Yang,et al. DARTS: Differentiable Architecture Search , 2018, ICLR.

[18] Quoc V. Le,et al. Large-Scale Evolution of Image Classifiers , 2017, ICML.

[19] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[20] Kalyanmoy Deb,et al. A Comparative Analysis of Selection Schemes Used in Genetic Algorithms , 1990, FOGA.

[21] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[22] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24] Ankur Bapna,et al. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[25] Aaron Klein,et al. Towards Automated Deep Learning: Efficient Joint Neural Architecture and Hyperparameter Search , 2018, ArXiv.

[26] Vijay Vasudevan,et al. Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27] Samy Bengio,et al. Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[28] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[29] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.

[30] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[31] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[32] Quoc V. Le,et al. Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[33] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[34] Ameet Talwalkar,et al. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[35] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.

[36] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.

[37] Yann Dauphin,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[38] Yann Dauphin,et al. Language Modeling with Gated Convolutional Networks , 2016, ICML.

[39] Liang Lin,et al. SNAS: Stochastic Neural Architecture Search , 2018, ICLR.

[40] Alok Aggarwal,et al. Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[41] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[42] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .