One Student Knows All Experts Know: From Sparse to Dense

Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is hard to implement, easy to overfit, and not hardwarefriendly. In this work, inspired by human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, we first propose Singular Value Decomposition Knowledge Gathering (SVD-KG) to gather key knowledge from different pretrained experts. We then refine the dense student model by knowledge distillation to offset the noise from gathering. On ImageNet, our OneS preserves 61.7% benefits from MoE. OneS can achieve 78.4% top-1 accuracy with only 15M parameters. On four natural language processing datasets, OneS obtains 88.2% MoE benefits and outperforms SoTA by 51.7% using the same architecture and training data. In addition, compared with the MoE counterpart, OneS can achieve 3.7× inference speedup due to the hardware-friendly architecture.

[1]  Zhiyuan Liu,et al.  Knowledge Inheritance for Pre-trained Language Models , 2021, ArXiv.

[2]  Carlos Riquelme,et al.  Scaling Vision with Sparse Mixture of Experts , 2021, NeurIPS.

[3]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[5]  Ammar Ahmad Awan,et al.  DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale , 2022 .

[6]  Andrew M. Dai,et al.  GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ArXiv.

[7]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[8]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[9]  Zhiyuan Liu,et al.  CPM-2: Large-scale Cost-effective Pre-trained Language Models , 2021, AI Open.

[10]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[12]  Maosong Sun,et al.  MoEfication: Conditional Computation of Transformer Models for Efficient Inference , 2021, ArXiv.

[13]  Yu Sun,et al.  ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[14]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[15]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[16]  Zangwei Zheng,et al.  Sparse-MLP: A Fully-MLP Architecture with Conditional Computation , 2021, ArXiv.

[17]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[18]  Jiashi Feng,et al.  Revisit Knowledge Distillation: a Teacher-free Framework , 2019, ArXiv.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Zornitsa Kozareva,et al.  Efficient Large Scale Language Modeling with Mixtures of Experts , 2021, ArXiv.

[21]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Ann L. Brown,et al.  How people learn: Brain, mind, experience, and school. , 1999 .

[25]  Yang You,et al.  Go Wider Instead of Deeper , 2021, AAAI.

[26]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[27]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[28]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[29]  Pat Langley,et al.  Crafting Papers on Machine Learning , 2000, ICML.

[30]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[31]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[32]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[33]  Naman Goyal,et al.  BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.

[34]  Patrick H. Chen,et al.  DRONE: Data-aware Low-rank Compression for Large NLP Models , 2021, NeurIPS.