Structure Inducing Pre-Training

We present a theoretical analysis from first principles that establishes a novel connection between relational inductive bias of pre-training and fine-tuning performance while providing an extended view on general pre-training models. We further explore how existing pretraining methods impose relational inductive biases, finding that the vast majority of existing approaches focus almost exclusively on modelling relationships in an intra-sample manner, rather than a per-sample manner. We build upon these findings with simulations and empirical studies on standard benchmarks spanning 3 data modalities and 10 downstream tasks. These investigations validate our theoretical analyses, and provides a recipe to produce new pre-training methods which incorporate provably richer inductive biases than do existing methods in line with user specified relational graphs.

[1]  Sanjeev Arora,et al.  A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks , 2020, ICLR.

[2]  Eunsol Choi,et al.  Entities as Experts: Sparse Memory Access with Entity Supervision , 2020, EMNLP.

[3]  Yu Sun,et al.  ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[4]  Tom Sercu,et al.  Transformer protein language models are unsupervised structure learners , 2020, bioRxiv.

[5]  Zhe Zhao,et al.  K-BERT: Enabling Language Representation with Knowledge Graph , 2019, AAAI.

[6]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[7]  Xing Xie,et al.  GraphFormers: GNN-nested Language Models for Linked Text Representation , 2021, ArXiv.

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Daniel Jurafsky,et al.  Measuring the Evolution of a Scientific Field through Citation Frames , 2018, TACL.

[10]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[11]  Mihaela van der Schaar,et al.  VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain , 2020, NeurIPS.

[12]  Lorenzo Rosasco,et al.  Multiclass Learning with Simplex Coding , 2012, NIPS.

[13]  Zhiyuan Liu,et al.  ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning , 2020, ACL.

[14]  Heng Huang,et al.  Deep Attributed Network Embedding , 2018, IJCAI.

[15]  Zhen Wang,et al.  Knowledge Graph Embedding by Translating on Hyperplanes , 2014, AAAI.

[16]  Hinrich Schütze,et al.  E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT , 2019, FINDINGS.

[17]  Chengsheng Mao,et al.  KG-BERT: BERT for Knowledge Graph Completion , 2019, ArXiv.

[18]  Hao Tian,et al.  ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation , 2021, ArXiv.

[19]  Jingfei Du,et al.  Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning , 2020, ICLR.

[20]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[21]  Donghan Yu,et al.  JAKET: Joint Pre-training of Knowledge Graph and Language Understanding , 2020, AAAI.

[22]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[23]  Dong-Sheng Cao,et al.  MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction , 2021, Briefings Bioinform..

[24]  Junzhou Huang,et al.  SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction , 2019, BCB.

[25]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[26]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[27]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[28]  Chengqi Zhang,et al.  Homophily, Structure, and Content Augmented Network Representation Learning , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[29]  Amit Sheth,et al.  KI-BERT: Infusing Knowledge Context for Better Language and Domain Understanding , 2021, ArXiv.

[30]  Xuanjing Huang,et al.  K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters , 2020, FINDINGS.

[31]  Chengyu Wang,et al.  SMedBERT: A Knowledge-Enhanced Pre-trained Language Model with Structured Semantics for Medical Text Mining , 2021, ACL.

[32]  Huajun Chen,et al.  Drop Redundant, Shrink Irrelevant: Selective Knowledge Injection for Language Pretraining , 2021, IJCAI.

[33]  Zhiyuan Liu,et al.  KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation , 2019, Transactions of the Association for Computational Linguistics.

[34]  Qun Liu,et al.  KELM: Knowledge Enhanced Pre-Trained Language Representations with Message Passing on Hierarchical Relational Graphs , 2021, ArXiv.

[35]  Fei Huang,et al.  Improving Biomedical Pretrained Language Models with Knowledge , 2021, BIONLP.

[36]  J. Leskovec,et al.  Open Graph Benchmark: Datasets for Machine Learning on Graphs , 2020, NeurIPS.

[37]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[38]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[39]  Ole Winther,et al.  NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning , 2018, bioRxiv.

[40]  Mark Heimann,et al.  Generalizing Graph Neural Networks Beyond Homophily , 2020, ArXiv.

[41]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[42]  Peter Szolovits,et al.  Adversarial Contrastive Pre-training for Protein Sequences , 2021, ArXiv.

[43]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[44]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[45]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[46]  A. Mukherjee Approximation Theorems and Whitney’s Embedding , 2015 .

[47]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[48]  Peter Szolovits,et al.  A Comprehensive Evaluation of Multi-task Learning and Multi-task Pre-training on EHR Time-series Data , 2020, ArXiv.

[49]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[50]  Jie Zhou,et al.  Adaptive Graph Encoder for Attributed Graph Embedding , 2020, KDD.

[51]  Yuxiao Dong,et al.  A Review of Microsoft Academic Services for Science of Science Studies , 2019, Front. Big Data.

[52]  Zheng Zhang,et al.  CoLAKE: Contextualized Language and Knowledge Embedding , 2020, COLING.

[53]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[54]  Seunghyun Park,et al.  Pre-Training of Deep Bidirectional Protein Sequence Representations With Structural Information , 2019, IEEE Access.

[55]  Marinka Zitnik,et al.  Representation Learning for Networks in Biology and Medicine: Advancements, Challenges, and Opportunities , 2021, ArXiv.

[56]  Bert Huang,et al.  Learning a Distance Metric from a Network , 2011, NIPS.

[57]  Nigel Collier,et al.  Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT , 2021, EMNLP.

[58]  Jian-Yun Nie,et al.  RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space , 2018, ICLR.

[59]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[60]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[61]  Waleed Ammar,et al.  Structural Scaffolds for Citation Intent Classification in Scientific Publications , 2019, NAACL.

[62]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[63]  Philip S. Yu,et al.  KG-BART: Knowledge Graph-Augmented BART for Generative Commonsense Reasoning , 2020, AAAI.

[64]  Xin Jiang,et al.  KgPLM: Knowledge-guided Language Model Pre-training via Generative and Discriminative Learning , 2020, ArXiv.

[65]  J. Lee,et al.  Predicting What You Already Know Helps: Provable Self-Supervised Learning , 2020, NeurIPS.

[66]  Marinka Zitnik,et al.  Graph Meta Learning via Local Subgraphs , 2020, NeurIPS.

[67]  Yoshihiro Yamanishi,et al.  Supervised Graph Inference , 2004, NIPS.

[68]  Yanchun Zhang,et al.  Community Detection in Attributed Graphs: An Embedding Approach , 2018, AAAI.

[69]  Martin Wattenberg,et al.  Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[70]  J. Leskovec,et al.  Strategies for Pre-training Graph Neural Networks , 2019, ICLR.

[71]  Oshin Agarwal,et al.  Large Scale Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training , 2020, ArXiv.

[72]  Jure Leskovec,et al.  Evolution of resilience in protein interactomes across the tree of life , 2018, Proceedings of the National Academy of Sciences.

[73]  Mahdieh Soleymani Baghshah,et al.  MG-BERT: Multi-Graph Augmented BERT for Masked Language Modeling , 2021, Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15).

[74]  Jie Hou,et al.  DeepSF: deep convolutional neural network for mapping protein sequences to folds , 2017, Bioinform..

[75]  A General Method for Transferring Explicit Knowledge into Language Model Pretraining , 2021, Security and Communication Networks.

[76]  Noah A. Smith,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP.

[77]  Kenneth D. Forbus,et al.  Combining Analogy with Language Models for Knowledge Extraction , 2021, AKBC.

[78]  Dmitry Chudakov,et al.  Local fitness landscape of the green fluorescent protein , 2016, Nature.

[79]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[80]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[81]  Rajgopal Kannan,et al.  GraphSAINT: Graph Sampling Based Inductive Learning Method , 2019, ICLR.

[82]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[83]  Tony Jebara,et al.  Structure preserving embedding , 2009, ICML '09.

[84]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[85]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[86]  Karsten M. Borgwardt,et al.  Topological Autoencoders , 2019, ICML.

[87]  Hiroyuki Shindo,et al.  LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention , 2020, EMNLP.

[88]  D. Baker,et al.  Global analysis of protein folding using massively parallel design, synthesis, and testing , 2017, Science.