Pre-Trained Models: Past, Present and Future

[1]  Maosong Sun,et al.  Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification , 2021, ACL.

[2]  Xipeng Qiu,et al.  A Survey of Transformers , 2021, AI Open.

[3]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[4]  Hai-Tao Zheng,et al.  CLINE: Contrastive Learning with Semantic Negative Examples for Natural Language Understanding , 2021, ACL.

[5]  Chang Zhou,et al.  CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[6]  Zhiyuan Liu,et al.  PTR: Prompt Tuning with Rules for Text Classification , 2021, AI Open.

[7]  Xipeng Qiu,et al.  Token-Aware Virtual Adversarial Training in Natural Language Understanding , 2021, AAAI.

[8]  Kaisheng M. Wang,et al.  PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation , 2021, ArXiv.

[9]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[10]  Olatunji Ruwase,et al.  ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep learning , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Zhilin Yang,et al.  FastMoE: A Fast Mixture-of-Expert Training System , 2021, ArXiv.

[12]  Hai-Tao Zheng,et al.  Prototypical Representation Learning for Relation Extraction , 2021, ICLR.

[13]  Zhilin Yang,et al.  Controllable Generation from Pre-trained Language Models via Inverse Prompting , 2021, KDD.

[14]  Zhengxiao Du,et al.  GPT Understands, Too , 2021, AI Open.

[15]  Zhiwu Lu,et al.  WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training , 2021, ArXiv.

[16]  Roy Schwartz,et al.  Random Feature Attention , 2021, ICLR.

[17]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[18]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[19]  Hao Zhang,et al.  TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models , 2021, ICML.

[20]  William W. Cohen,et al.  Reasoning Over Virtual Knowledge Bases With Open Predicate Relations , 2021, ICML.

[21]  Zhiyuan Liu,et al.  Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks , 2021, Machine Intelligence Research.

[22]  Olatunji Ruwase,et al.  ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.

[23]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[24]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[25]  Hua Wu,et al.  ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora , 2020, EMNLP.

[26]  Zhiyuan Liu,et al.  ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning , 2020, ACL.

[27]  Maosong Sun,et al.  Towards a Universal Continuous Knowledge Base , 2020, AI Open.

[28]  Huanqi Cao,et al.  CPM: A Large-scale Generative Chinese Pre-trained Language Model , 2020, AI Open.

[29]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Qun Liu,et al.  Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads , 2020, AI Open.

[31]  Rethinking Attention with Performers , 2020, ICLR.

[32]  Ken-ichi Kawarabayashi,et al.  How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks , 2020, ICLR.

[33]  Hinrich Schütze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[34]  Goran Glavas,et al.  Is Supervised Syntactic Parsing Beneficial for Language Understanding Tasks? An Empirical Investigation , 2020, EACL.

[35]  Weihua Luo,et al.  On Learning Universal Representations Across Languages , 2020, ICLR.

[36]  Ming Zhou,et al.  InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training , 2020, NAACL.

[37]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[38]  Hua Wu,et al.  PLATO-2: Towards Building an Open-Domain Chatbot via Curriculum Learning , 2020, FINDINGS.

[39]  Jianfeng Gao,et al.  M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Mary Williamson,et al.  Recipes for Building an Open-Domain Chatbot , 2020, EACL.

[41]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2020, TACL.

[42]  Nan Duan,et al.  XGPT: Cross-modal Generative Pre-Training for Image Captioning , 2020, NLPCC.

[43]  Xuanjing Huang,et al.  K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters , 2020, FINDINGS.

[44]  Zhiyuan Liu,et al.  KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation , 2019, Transactions of the Association for Computational Linguistics.

[45]  Zhiyuan Liu,et al.  Adversarial Language Games for Advanced Natural Language Intelligence , 2019, AAAI.

[46]  Wanxiang Che,et al.  Pre-Training with Whole Word Masking for Chinese BERT , 2019, ArXiv.

[47]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Zhilin Yang,et al.  All NLP Tasks Are Generation Tasks: A General Pretraining Framework , 2021, ArXiv.

[49]  Maosong Sun,et al.  Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning , 2020, ArXiv.

[50]  Xuanjing Huang,et al.  Generating Adversarial Examples in Chinese Texts Using Sentence-Pieces , 2020, ArXiv.

[51]  Minjia Zhang,et al.  Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping , 2020, NeurIPS.

[52]  Bin Dai,et al.  Further Analysis of Outlier Detection with Deep Generative Models , 2020, NeurIPS.

[53]  Dawn Song,et al.  Language Models are Open Knowledge Graphs , 2020, ArXiv.

[54]  Wenhu Chen,et al.  KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation , 2020, EMNLP.

[55]  Zheng Zhang,et al.  CoLAKE: Contextualized Language and Knowledge Embedding , 2020, COLING.

[56]  Qun Liu,et al.  TernaryBERT: Distillation-aware Ultra-low Bit BERT , 2020, EMNLP.

[57]  Olatunji Ruwase,et al.  DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[58]  Tie-Yan Liu,et al.  Variance-reduced Language Pretraining via a Mask Proposal Network , 2020, ArXiv.

[59]  Minlie Huang,et al.  A Large-Scale Chinese Short-Text Conversation Dataset , 2020, NLPCC.

[60]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[61]  Samuel R. Bowman,et al.  Can neural networks acquire a structural bias from raw linguistic data? , 2020, CogSci.

[62]  William W. Cohen,et al.  Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge , 2020, ArXiv.

[63]  Felice Dell'Orletta,et al.  Contextual and Non-Contextual Word Embeddings: an in-depth Linguistic Investigation , 2020, REPL4NLP.

[64]  Paul N. Bennett,et al.  Knowledge-Aware Language Model Pretraining , 2020, ArXiv.

[65]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[66]  Jie Tang,et al.  Self-Supervised Learning: Generative or Contrastive , 2020, IEEE Transactions on Knowledge and Data Engineering.

[67]  Quoc V. Le,et al.  Rethinking Pre-training and Self-training , 2020, NeurIPS.

[68]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.

[69]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[70]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[71]  Anna Rumshisky,et al.  When BERT Plays the Lottery, All Tickets Are Winning , 2020, EMNLP.

[72]  Qun Liu,et al.  Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT , 2020, ACL.

[73]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[74]  Zhiyuan Liu,et al.  Train No Evil: Selective Masking for Task-guided Pre-training , 2020, EMNLP.

[75]  Tie-Yan Liu,et al.  MPNet: Masked and Permuted Pre-training for Language Understanding , 2020, NeurIPS.

[76]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[77]  Eunsol Choi,et al.  Entities as Experts: Sparse Memory Access with Entity Supervision , 2020, EMNLP.

[78]  Chenliang Li,et al.  PALM: Pre-training an Autoencoding&autoregressive Language Model for Context-conditioned Generation , 2020, EMNLP.

[79]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[80]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[81]  Shuangzhi Wu,et al.  Alternating Language Modeling for Cross-Lingual Pre-Training , 2020, AAAI.

[82]  Xipeng Qiu,et al.  BERT-ATTACK: Adversarial Attack against BERT Using BERT , 2020, EMNLP.

[83]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[84]  Xipeng Qiu,et al.  Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[85]  Gu Jin,et al.  SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping , 2020, ASPLOS.

[86]  Li Dong,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[87]  Graham Neubig,et al.  Differentiable Reasoning over a Virtual Knowledge Base , 2020, ICLR.

[88]  Mitchell A. Gordon,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, REPL4NLP.

[89]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[90]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[91]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[92]  Regina Barzilay,et al.  Blank Language Models , 2020, EMNLP.

[93]  David Vilares,et al.  Parsing as Pretraining , 2020, AAAI.

[94]  Jihun Choi,et al.  Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction , 2020, ICLR.

[95]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[96]  Wei-Tsung Kao,et al.  Further Boosting BERT-based Models by Duplicating Existing Layers: Some Intriguing Phenomena inside BERT , 2020, ArXiv.

[97]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[98]  Lin Su,et al.  ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[99]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[100]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[101]  Minlie Huang,et al.  A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation , 2020, TACL.

[102]  Wenhan Xiong,et al.  Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model , 2019, ICLR.

[103]  Peter J. Liu,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[104]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[105]  Steven Schockaert,et al.  Inducing Relational Knowledge from BERT , 2019, AAAI.

[106]  Frank F. Xu,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[107]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[108]  Hinrich Schütze,et al.  E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT , 2019, FINDINGS.

[109]  Minlie Huang,et al.  SentiLARE: Linguistic Knowledge Enhanced Language Representation for Sentiment Analysis , 2019, EMNLP.

[110]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[111]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[112]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[113]  Meng Zhang,et al.  Textual Adversarial Attack as Combinatorial Optimization , 2019, 1910.12196.

[114]  Lin-Shan Lee,et al.  SpeechBERT: An Audio-and-Text Jointly Learned Language Model for End-to-End Spoken Question Answering , 2019, INTERSPEECH.

[115]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[116]  Lei Yu,et al.  A Mutual Information Maximization Perspective of Language Representation Learning , 2019, ICLR.

[117]  Hua Wu,et al.  PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable , 2019, ACL.

[118]  Samyam Rajbhandari,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[119]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[120]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[121]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[122]  Jason J. Corso,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.

[123]  Li Dong,et al.  Cross-Lingual Natural Language Generation via Pre-Training , 2019, AAAI.

[124]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[125]  Zhe Zhao,et al.  K-BERT: Enabling Language Representation with Knowledge Graph , 2019, AAAI.

[126]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[127]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[128]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[129]  Allyson Ettinger,et al.  What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.

[130]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[131]  Joey Tianyi Zhou,et al.  Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2019, AAAI.

[132]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[133]  Rémi Gribonval,et al.  And the Bit Goes Down: Revisiting the Quantization of Neural Networks , 2019, ICLR.

[134]  Ning Chen,et al.  Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness , 2019, ICLR.

[135]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[136]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[137]  Yibo Zhu,et al.  A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.

[138]  Chang Zhou,et al.  CogLTX: Applying BERT to Long Texts , 2020, NeurIPS.

[139]  Vinh Phu Nguyen,et al.  Merlin: A GPU Accelerated Recommendation Framework , 2020 .

[140]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[141]  Shikha Bordia,et al.  Do Attention Heads in BERT Track Syntactic Dependencies? , 2019, ArXiv.

[142]  M. de Rijke,et al.  Understanding Multi-Head Attention in Abstractive Summarization , 2019, ArXiv.

[143]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[144]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[145]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[146]  Jungo Kasai,et al.  Cracking the Contextual Commonsense Code: Understanding Commonsense Reasoning Aptitude of Deep Contextual Representations , 2019, EMNLP.

[147]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[148]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[149]  Sameer Singh,et al.  Do NLP Models Know Numbers? Probing Numeracy in Embeddings , 2019, EMNLP.

[150]  Noah A. Smith,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP.

[151]  Ming Zhou,et al.  Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks , 2019, EMNLP.

[152]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[153]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[154]  Alexander M. Rush,et al.  Commonsense Knowledge Mining from Pretrained Models , 2019, EMNLP.

[155]  Xiaozhe Ren,et al.  NEZHA: Neural Contextualized Representation for Chinese Language Understanding , 2019, ArXiv.

[156]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[157]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[158]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[159]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[160]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[161]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[162]  Tal Linzen,et al.  Quantity doesn’t buy quality syntax with neural language models , 2019, EMNLP.

[163]  David Reitter,et al.  Fusion of Detected Objects in Text for Visual Question Answering , 2019, EMNLP.

[164]  Yejin Choi,et al.  Do Neural Language Representations Learn Physical Commonsense? , 2019, CogSci.

[165]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[166]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[167]  Rui Li,et al.  OAG: Toward Linking Large-scale Heterogeneous Entity Graphs , 2019, KDD.

[168]  Hung-Yu Kao,et al.  Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.

[169]  Guillaume Lample,et al.  Large Memory Layers with Product Keys , 2019, NeurIPS.

[170]  Rudolf Rosa,et al.  Inducing Syntactic Trees from BERT Representations , 2019, ArXiv.

[171]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[172]  Yejin Choi,et al.  COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.

[173]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[174]  Jeffrey Ling,et al.  Matching the Blanks: Distributional Similarity for Relation Learning , 2019, ACL.

[175]  Robert Frank,et al.  Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[176]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[177]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[178]  Di He,et al.  Efficient Training of BERT by Progressively Stacking , 2019, ICML.

[179]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[180]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[181]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[182]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[183]  Chang Zhou,et al.  Cognitive Graph for Multi-Hop Reading Comprehension at Scale , 2019, ACL.

[184]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[185]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[186]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[187]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[188]  Yu Sun,et al.  ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[189]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[190]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[191]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[192]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[193]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[194]  Zheng Zhang,et al.  Star-Transformer , 2019, NAACL.

[195]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[196]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[197]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[198]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[199]  Kaiming He,et al.  Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[200]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[201]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[202]  Eric Wallace,et al.  Trick Me If You Can: Human-in-the-Loop Generation of Adversarial Examples for Question Answering , 2018, TACL.

[203]  Minjie Wang,et al.  Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.

[204]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[205]  Sangeetha Abdu Jyothi,et al.  TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.

[206]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning , 2018, ACM Comput. Surv..

[207]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[208]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[209]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[210]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[211]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[212]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[213]  Aleksander Madry,et al.  Adversarially Robust Generalization Requires More Data , 2018, NeurIPS.

[214]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[215]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[216]  Dan Alistarh,et al.  Model compression via distillation and quantization , 2018, ICLR.

[217]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[218]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[219]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[220]  Max Welling,et al.  Modeling Relational Data with Graph Convolutional Networks , 2017, ESWC.

[221]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[222]  Jun Zhu,et al.  ZhuSuan: A Library for Bayesian Deep Learning , 2017, ArXiv.

[223]  Yang You,et al.  Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.

[224]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[225]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[226]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[227]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[228]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[229]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[230]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[231]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[232]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[233]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[234]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[235]  Allyson Ettinger,et al.  Probing for semantic evidence of composition by means of simple classification tasks , 2016, RepEval@ACL.

[236]  Ido Dagan,et al.  context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[237]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[238]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[239]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[240]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[241]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[242]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[243]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[244]  Xipeng Qiu,et al.  Recurrent Neural Network for Text Classification with MultiTask Learning , 2016 .

[245]  Arne Köhn,et al.  What’s in an Embedding? Analyzing Word Embeddings through Multilingual Evaluation , 2015, EMNLP.

[246]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[247]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[248]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[249]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[250]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[251]  Nikos Komodakis,et al.  Object Detection via a Multi-region and Semantic Segmentation-Aware CNN Model , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[252]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[253]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[254]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[255]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[256]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[257]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[258]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[259]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[260]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[261]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[262]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[263]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[264]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[265]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[266]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[267]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[268]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[269]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[270]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[271]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[272]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[273]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[274]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[275]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[276]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[277]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[278]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[279]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[280]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[281]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[282]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[283]  A. Kiureghian,et al.  Aleatory or epistemic? Does it matter? , 2009 .

[284]  Changshui Zhang,et al.  Transferred Dimensionality Reduction , 2008, ECML/PKDD.

[285]  Jiawei Han,et al.  Knowledge transfer via multiple model local structure mapping , 2008, KDD.

[286]  Qiang Yang,et al.  Self-taught clustering , 2008, ICML '08.

[287]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[288]  M. D’Esposito Working memory. , 2008, Handbook of clinical neurology.

[289]  Edwin V. Bonilla,et al.  Multi-task Gaussian Process Prediction , 2007, NIPS.

[290]  Qiang Yang,et al.  Co-clustering based classification for out-of-domain documents , 2007, KDD '07.

[291]  Raymond J. Mooney,et al.  Mapping and Revising Markov Logic Networks for Transfer Learning , 2007, AAAI.

[292]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[293]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[294]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[295]  Tong Zhang,et al.  A High-Performance Semi-Supervised Learning Method for Text Chunking , 2005, ACL.

[296]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[297]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[298]  Neil D. Lawrence,et al.  Learning to learn with the informative vector machine , 2004, ICML.

[299]  P. Barrouillet,et al.  Time constraints and resource sharing in adults' working memory spans. , 2004, Journal of experimental psychology. General.

[300]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[301]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[302]  Sebastian Thrun,et al.  Learning to Learn: Introduction and Overview , 1998, Learning to Learn.

[303]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[304]  T. E. Lange,et al.  Below the Surface: Analogical Similarity and Retrieval Competition in Reminding , 1994, Cognitive Psychology.

[305]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[306]  John Brown Some Tests of the Decay Theory of Immediate Memory , 1958 .

[307]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .