Citation Intent Classification Using Word Embedding

Citation analysis is an active area of research for various reasons. So far, statistical approaches are mainly used for citation analysis, which does not look into the internal context of the citations. Deep analysis of citation may reveal interesting findings by utilizing deep neural network algorithms. The existing scholarly datasets are best suited for statistical approaches but lack citation context, intent, and section information. Furthermore, the datasets are too small to be used with deep learning approaches. For citation intent analysis, the datasets must have a citation context labeled with different citation intent classes. Most of the datasets either do not have labeled context sentences, or the sample is too small to be generalized. In this study, we critically investigated the available datasets for citation intent and proposed an automated citation intent technique to label the citation context with citation intent. Furthermore, we annotated ten million citation contexts with citation intent from Citation Context Dataset (C2D) dataset with the help of our proposed method. We applied Global Vectors (GloVe), Infersent, and Bidirectional Encoder Representations from Transformers (BERT) word embedding methods and compared their Precision, Recall, and F1 measures. It was found that BERT embedding performs significantly better, having an 89% Precision score. The labeled dataset, which is freely available for research purposes, will enhance the study of citation context analysis. Finally, It can be used as a benchmark dataset for finding the citation motivation and function from in-text citations.

[1]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[2]  Muhammad Tanvir Afzal,et al.  Extending co-citation using sections of research articles , 2018 .

[3]  Sami Sieranoja,et al.  How much can k-means be improved by using better initialization and repeats? , 2019, Pattern Recognit..

[4]  Petr Knoth,et al.  Using citation-context to reduce topic drifting on pure citation-based recommendation , 2018, RecSys.

[5]  R. Wears,et al.  Journal prestige, publication bias, and other characteristics associated with citation of published studies in peer-reviewed journals. , 2002, JAMA.

[6]  Dominika Tkaczyk,et al.  Crossref: The sustainable source of community-owned scholarly metadata , 2020, Quantitative Science Studies.

[7]  Manpreet Kaur,et al.  Neural ParsCit: a deep learning-based reference string parser , 2018, International Journal on Digital Libraries.

[8]  Nigel Harwood An interview-based study of the functions of citations in academic writing across two disciplines , 2009 .

[9]  Henry G. Small,et al.  Interpreting maps of science using citation context sentiments: a preliminary investigation , 2011, Scientometrics.

[10]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[11]  Michael Ley,et al.  DBLP - Some Lessons Learned , 2009, Proc. VLDB Endow..

[12]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[13]  Peter Willett,et al.  The Porter stemming algorithm: then and now , 2006, Program.

[14]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[15]  Jie Tang,et al.  AMiner: Toward Understanding Big Scholar Data , 2016, WSDM.

[16]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[17]  Oren Etzioni,et al.  Identifying Meaningful Citations , 2015, AAAI Workshop: Scholarly Big Data.

[18]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[19]  Bin Wang,et al.  Evaluating word embedding models: methods and experimental results , 2019, APSIPA Transactions on Signal and Information Processing.

[20]  Adam Jatowt,et al.  A High-Quality Gold Standard for Citation-based Tasks , 2018, LREC.

[21]  Yang Song,et al.  An Overview of Microsoft Academic Service (MAS) and Applications , 2015, WWW.

[22]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[23]  Leland McInnes,et al.  hdbscan: Hierarchical density based clustering , 2017, J. Open Source Softw..

[24]  M. Moravcsik,et al.  Some Results on the Function and Quality of Citations , 1975 .

[25]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[26]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[27]  Markus Freudenberg,et al.  Interlinking SciGraph and DBpedia Datasets Using Link Discovery and Named Entity Recognition Techniques , 2019, LDK.

[28]  Simone Teufel,et al.  Automatic classification of citation function , 2006, EMNLP.

[29]  Zhicheng Liu,et al.  A Survey on Sampling and Profiling over Big Data (Technical Report) , 2020, ArXiv.

[30]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[31]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[32]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[33]  Vibhu O. Mittal,et al.  Stemming and its effects on TFIDF ranking (poster session) , 2000, SIGIR '00.

[34]  Muhammad Tanvir Afzal,et al.  Section-wise indexing and retrieval of research articles , 2018, Cluster Computing.

[35]  Daniel Jurafsky,et al.  Measuring the Evolution of a Scientific Field through Citation Frames , 2018, TACL.

[36]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[37]  Michael Färber,et al.  Bibliometric-Enhanced arXiv: A Data Set for Paper-Based and Citation-Based Tasks , 2019, BIR@ECIR.

[38]  Dragomir R. Radev,et al.  Purpose and Polarity of Citation: Towards NLP-based Bibliometrics , 2013, NAACL.

[39]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[40]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[41]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[42]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[43]  D. Upton,et al.  Evaluation of the Impact of Touch Screen Technology on People with Dementia and their Carers Within Care Home Settings. , 2011 .

[44]  Kyle Lo,et al.  S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[45]  Eunjeong Park,et al.  A context-aware citation recommendation model with BERT and graph convolutional networks , 2019, Scientometrics.

[46]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[47]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[48]  Waleed Ammar,et al.  Structural Scaffolds for Citation Intent Classification in Scientific Publications , 2019, NAACL.

[49]  J. Ziman,et al.  Public knowledge. An essay concerning the social dimension of science , 1970, Medical History.

[50]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.