Aspect-based Document Similarity for Research Papers

Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications like recommender systems that rely on document similarity. In this paper, we extend similarity with aspect information by performing a pairwise document classification task. We evaluate our aspect-based document similarity for research papers. Paper citations indicate the aspect-based similarity, i.e., the section title in which a citation occurs acts as a label for the pair of citing and cited paper. We apply a series of Transformer models such as RoBERTa, ELECTRA, XLNet, and BERT variations and compare them to an LSTM baseline. We perform our experiments on two newly constructed datasets of 172,073 research paper pairs from the ACL Anthology and CORD-19 corpus. Our results show SciBERT as the best performing system. A qualitative examination validates our quantitative results. Our findings motivate future research of aspect-based document similarity and the development of a recommender system based on the evaluated techniques. We make our datasets, code, and trained models publicly available.

[1]  Yuji Matsumoto,et al.  Citation Recommendation Using Distributed Representation of Discourse Facets in Scientific Articles , 2018, JCDL.

[2]  Carolyn Penstein Rosé,et al.  SciSumm: A Multi-Document Summarization System for Scientific Articles , 2011, ACL.

[3]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[4]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[5]  Michael Ley,et al.  DBLP - Some Lessons Learned , 2009, Proc. VLDB Endow..

[6]  Dafna Shahaf,et al.  31 SOLVENT : A Mixed Initiative System for Finding Analogies between Research Papers , 2018 .

[7]  Iryna Gurevych,et al.  Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[8]  Iryna Gurevych,et al.  A Reflective View on Text Similarity , 2011, RANLP.

[9]  Simone Paolo Ponzetto,et al.  Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context , 2018, JCDL.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Shujian Huang,et al.  Enhancing Statistical Machine Translation with Character Alignment , 2012, ACL.

[12]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[13]  Surajit Chaudhuri,et al.  Targeted disambiguation of ad-hoc, homogeneous sets of named entities , 2012, WWW.

[14]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[15]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[16]  Dragomir R. Radev,et al.  The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics , 2008, LREC.

[17]  David R. Traum,et al.  Improving question-answering with linking dialogues , 2006, IUI '06.

[18]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[19]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[20]  Joeran Beel,et al.  Document Embeddings vs. Keyphrases vs. Terms: An Online Evaluation in Digital Library Recommender Systems , 2019, ArXiv.

[21]  C. Lee Giles,et al.  CODA-19: Reliably Annotating Research Aspects on 10,000+ CORD-19 Abstracts Using a Non-Expert Crowd , 2020, ArXiv.

[22]  Oren Etzioni,et al.  CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.

[23]  Hebatallah A. Mohamed Hassan,et al.  BERT, ELMo, USE and InferSent Sentence Encoders: The Panacea for Research-Paper Recommendation? , 2019, RecSys.

[24]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[25]  Nan Sun,et al.  Query Segmentation Based on Eigenspace Similarity , 2009, ACL/IJCNLP.

[26]  Bela Gipp,et al.  Research-paper recommender systems: a literature survey , 2015, International Journal on Digital Libraries.

[27]  Victor Zue,et al.  Experiments in Evaluating Interactive Spoken Language Systems , 1992, HLT.

[28]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[29]  Moritz Schubotz,et al.  Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles , 2020, JCDL.

[30]  Bela Gipp,et al.  Enriching BERT with Knowledge Graph Embeddings for Document Classification , 2019, KONVENS.

[31]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[32]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[33]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[34]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[35]  Stephen Clark,et al.  Transition-Based Parsing of the Chinese Treebank using a Global Discriminative Model , 2009, IWPT.

[36]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[37]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[39]  Kyle Lo,et al.  S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[40]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[41]  Jimmy J. Lin,et al.  DocBERT: BERT for Document Classification , 2019, ArXiv.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Iryna Gurevych,et al.  UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures , 2012, *SEMEVAL.

[44]  Cheng Li,et al.  Semantic Text Matching for Long-Form Documents , 2019, WWW.

[45]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[46]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[47]  Shimon Edelman,et al.  Similarity-based Word Sense Disambiguation , 1998, CL.

[48]  Doug Downey,et al.  SPECTER: Document-level Representation Learning using Citation-informed Transformers , 2020, ACL.

[49]  Johanna D. Moore,et al.  Evaluating information presentation strategies for spoken recommendations , 2007, RecSys '07.

[50]  Kuansan Wang,et al.  A Scalable Hybrid Research Paper Recommender System for Microsoft Academic , 2019, WWW.

[51]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.