论文信息 - Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder - 字舞流文

Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder

Many real-world applications use Siamese networks to efficiently match text sequences at scale, which require high-quality sequence encodings. This paper pre-trains language models dedicated to sequence matching in Siamese architectures. We first hypothesize that a representation is better for sequence matching if the entire sequence can be reconstructed from it, which, however, is unlikely to be achieved in standard autoencoders: A strong decoder can rely on its capacity and natural language patterns to reconstruct and bypass the needs of better sequence encodings. Therefore we propose a new self-learning method that pretrains the encoder with a weak decoder, which reconstructs the original sequence from the encoder’s [CLS] representations but is restricted in both capacity and attention span. In our experiments on web search and recommendation, the pre-trained SEED-Encoder, “SiamEsE oriented encoder by reconstructing from weak decoder”, shows significantly better generalization ability when fine-tuned in Siamese networks, improving overall accuracy and few-shot performances. Our code and models will be released.

Paul N. Bennett | Chenyan Xiong | Paul Bennett | Zhicheng Dou | Arnold Overwijk | Di He | Guolin Ke | Shuqi Lu | Waleed Malik | Tieyan Liu | Chenyan Xiong | Tie-Yan Liu | Zhicheng Dou | Di He | Arnold Overwijk | Guolin Ke | Paul Bennett | Shuqi Lu | Waleed Malik

[1] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[3] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[4] Bhaskar Mitra,et al. Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[5] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[6] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7] Jeff Johnson,et al. Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[8] Hao Tian,et al. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[9] Ye Li,et al. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ArXiv.

[10] Xing Xie,et al. MIND: A Large-scale Dataset for News Recommendation , 2020, ACL.

[11] Xipeng Qiu,et al. TENER: Adapting Transformer Encoder for Named Entity Recognition , 2019, ArXiv.

[12] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[13] Jacob Eisenstein,et al. Sparse, Dense, and Attentional Representations for Text Retrieval , 2021, Transactions of the Association for Computational Linguistics.

[14] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[15] Jimmy J. Lin,et al. Document Expansion by Query Prediction , 2019, ArXiv.

[16] Xiujun Li,et al. Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space , 2020, EMNLP.

[17] Samy Bengio,et al. Generating Sentences from a Continuous Space , 2015, CoNLL.

[18] Tie-Yan Liu,et al. Rethinking Positional Encoding in Language Pre-training , 2020, ICLR.

[19] Ming-Wei Chang,et al. Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[20] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[21] Jianfeng Gao,et al. A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[22] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[23] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[25] Omer Levy,et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[26] Ankit Singh Rawat,et al. Are Transformers universal approximators of sequence-to-sequence functions? , 2020, ICLR.

[27] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[28] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[29] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[30] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[31] Danqi Chen,et al. Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[32] Zhuyun Dai,et al. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval , 2019, ArXiv.

[33] Iryna Gurevych,et al. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[34] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.