SChuBERT: Scholarly Document Chunks with BERT-encoding boost Citation Count Prediction.

Predicting the number of citations of scholarly documents is an upcoming task in scholarly document processing. Besides the intrinsic merit of this information, it also has a wider use as an imperfect proxy for quality which has the advantage of being cheaply available for large volumes of scholarly documents. Previous work has dealt with number of citations prediction with relatively small training data sets, or larger datasets but with short, incomplete input text. In this work we leverage the open access ACL Anthology collection in combination with the Semantic Scholar bibliometric database to create a large corpus of scholarly documents with associated citation information and we propose a new citation prediction model called SChuBERT. In our experiments we compare SChuBERT with several state-of-the-art citation prediction models and show that it outperforms previous methods by a large margin. We also show the merit of using more training data and longer input for number of citations prediction.

[1]  Timothy Baldwin,et al.  A Hybrid Model for Quality Assessment of Wikipedia Articles , 2017, ALTA.

[2]  Timothy Baldwin,et al.  A Joint Model for Multimodal Document Quality Assessment , 2019, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[3]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[4]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[5]  Barbara Plank,et al.  CiteTracked: A Longitudinal Dataset of Peer Reviews and Citations , 2019, BIRNDL@SIGIR.

[6]  Sadegh Aliakbary,et al.  Predicting citation counts based on deep neural network learning techniques , 2018, J. Informetrics.

[7]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[8]  Isabelle Augenstein,et al.  Longitudinal Citation Prediction using Temporal Graph Neural Networks , 2020, SDU@AAAI.

[9]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Lawrence D. Fu,et al.  Models for Predicting and Explaining Citation Count of Biomedical Articles , 2008, AMIA.

[12]  Xiaomei Bai,et al.  Predicting the citations of scholarly paper , 2019, J. Informetrics.

[13]  Omer Levy,et al.  BERT for Coreference Resolution: Baselines and Analysis , 2019, EMNLP/IJCNLP.

[14]  Kyle Lo,et al.  S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[15]  Bhavana Dalvi,et al.  A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications , 2018, NAACL.

[16]  Concha Bielza,et al.  Predicting citation count of Bioinformatics papers within four years of publication , 2009, Bioinform..

[17]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[18]  Noah A. Smith,et al.  To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks , 2019, RepL4NLP@ACL.

[19]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[20]  Stevan Harnad,et al.  Earlier Web Usage Statistics as Predictors of Later Citation Impact , 2005, J. Assoc. Inf. Sci. Technol..

[21]  Ji-Rong Wen,et al.  A Neural Citation Count Prediction Model based on Peer Review Text , 2019, EMNLP.

[22]  Lambert Schomaker,et al.  Structure-Tags Improve Text Classification for Scholarly Document Quality Prediction , 2020, SDP.

[23]  BaesensBart,et al.  To tune or not to tune , 2015 .

[24]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.