Multilevel Text Alignment with Cross-Document Attention

Text alignment finds application in tasks such as citation recommendation and plagiarism detection. Existing alignment methods operate at a single, predefined level and cannot learn to align texts at, for example, sentence and document levels. We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component, enabling structural comparisons across different levels (document-to-document and sentence-to-document). Our component is weakly supervised from document pairs and can align at multiple levels. Our evaluation on predicting document-to-document relationships and sentence-to-document relationships on the tasks of citation recommendation and plagiarism detection shows that our approach outperforms previously established hierarchical, attention encoders based on recurrent and transformer contextualization that are unaware of structural correspondence between documents.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Zhiguo Wang,et al.  Sentence Similarity Learning by Lexical Decomposition and Composition , 2016, COLING.

[3]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[4]  Kyle Lo,et al.  S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[5]  Jing Liu,et al.  News Citation Recommendation with Implicit and Explicit Semantics , 2016, ACL.

[6]  Jes'us Villalba,et al.  Hierarchical Transformers for Long Document Classification , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[7]  Pushmeet Kohli,et al.  Graph Matching Networks for Learning the Similarity of Graph Structured Objects , 2019, ICML.

[8]  Johannes Fürnkranz,et al.  All-in Text: Learning Document, Label, and Word Representations Jointly , 2016, AAAI.

[9]  James Henderson,et al.  GILE: A Generalized Input-Label Embedding for Text Classification , 2018, TACL.

[10]  Ion Androutsopoulos,et al.  A Survey of Paraphrasing and Textual Entailment Methods , 2009, J. Artif. Intell. Res..

[11]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[12]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[13]  Liu Yang,et al.  Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Document Matching , 2020, ArXiv.

[14]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[15]  Jonas Mueller,et al.  Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[16]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[17]  James Henderson,et al.  Document-Level Neural Machine Translation with Hierarchical Attention Networks , 2018, EMNLP.

[18]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[19]  Cheng Li,et al.  Semantic Text Matching for Long-Form Documents , 2019, WWW.

[20]  Jimmy J. Lin,et al.  Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling , 2012, NAACL.

[21]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[22]  Yang Liu,et al.  Structured Alignment Networks for Matching Sentences , 2018, EMNLP.

[23]  Keith Stevens,et al.  Document Encoder Pooling Dot Product Sent Enc Pooling Sent Enc Pooling Sent Enc Pooling DNN DNN DNN Sentence Encoder Pooling Pooling Dot Product Sentence Encoder Document Encoder Pooling Sentence Level Task Document Level Task , 2019 .

[24]  Norman Meuschke,et al.  Academic Plagiarism Detection , 2019, ACM Comput. Surv..

[25]  Lei Shi,et al.  A DOM Tree Alignment Model for Mining Parallel Data from the Web , 2006, ACL.

[26]  Mingbo Ma,et al.  Textual Entailment with Structured Attentions and Composition , 2016, COLING.

[27]  Andrei Popescu-Belis,et al.  Multilingual Hierarchical Attention Networks for Document Classification , 2017, IJCNLP.

[28]  Alexander M. Rush,et al.  Latent Alignment and Variational Attention , 2018, NeurIPS.

[29]  Chandra Bhagavatula,et al.  Content-Based Citation Recommendation , 2018, NAACL.

[30]  Ido Dagan,et al.  PROBABILISTIC TEXTUAL ENTAILMENT: GENERIC APPLIED MODELING OF LANGUAGE VARIABILITY , 2004 .

[31]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[32]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Yann LeCun,et al.  GLoMo: Unsupervisedly Learned Relational Graphs as Transferable Representations , 2018, ArXiv.

[35]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[36]  Guodong Zhou,et al.  Stance Detection with Hierarchical Attention Network , 2018, COLING.

[37]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[38]  Mirella Lapata,et al.  Hierarchical Transformers for Multi-Document Summarization , 2019, ACL.

[39]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[40]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[41]  Phil Blunsom,et al.  Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[42]  Laurent Besacier,et al.  Using Word Embedding for Cross-Language Plagiarism Detection , 2017, EACL.

[43]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[44]  James Henderson,et al.  A Model of Zero-Shot Learning of Spoken Language Understanding , 2015, EMNLP.

[45]  Stephen Wan,et al.  Using Dependency-Based Features to Take the ’Para-farce’ out of Paraphrase , 2006, ALTA.

[46]  Keith Stevens,et al.  Effective Parallel Corpus Mining using Bilingual Sentence Embeddings , 2018, WMT.

[47]  Heeyoung Lee,et al.  Joint Entity and Event Coreference Resolution across Documents , 2012, EMNLP.

[48]  Eunsol Choi,et al.  Hierarchical Question Answering for Long Documents , 2016, ArXiv.

[49]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[50]  Alexander M. Rush,et al.  Structured Attention Networks , 2017, ICLR.

[51]  Daniel Kifer,et al.  Context-aware citation recommendation , 2010, WWW '10.

[52]  Rong Jin,et al.  Distance Metric Learning: A Comprehensive Survey , 2006 .

[53]  Alberto Barrón-Cedeño,et al.  Plagiarism Detection across Distant Language Pairs , 2010, COLING.