Deep context of citations using machine-learning models in scholarly full-text articles

Information retrieval systems for scholarly literature rely heavily not only on text matching but on semantic- and context-based features. Readers nowadays are deeply interested in how important an article is, its purpose and how influential it is in follow-up research work. Numerous techniques to tap the power of machine learning and artificial intelligence have been developed to enhance retrieval of the most influential scientific literature. In this paper, we compare and improve on four existing state-of-the-art techniques designed to identify influential citations. We consider 450 citations from the Association for Computational Linguistics corpus, classified by experts as either important or unimportant, and further extract 64 features based on the methodology of four state-of-the-art techniques. We apply the Extra-Trees classifier to select 29 best features and apply the Random Forest and Support Vector Machine classifiers to all selected techniques. Using the Random Forest classifier, our supervised model improves on the state-of-the-art method by 11.25%, with 89% Precision-Recall area under the curve. Finally, we present our deep-learning model, the Long Short-Term Memory network, that uses all 64 features to distinguish important and unimportant citations with 92.57% accuracy.

[1]  Oren Etzioni,et al.  Identifying Meaningful Citations , 2015, AAAI Workshop: Scholarly Big Data.

[2]  Yuncheng Jiang,et al.  Semantic Search Exploiting Formal Concept Analysis, Rough Sets, and Wikipedia , 2018, Int. J. Semantic Web Inf. Syst..

[3]  Thed N. van Leeuwen,et al.  Some modifications to the SNIP journal impact indicator , 2012, J. Informetrics.

[4]  Rouslan A. Moro,et al.  Support Vector Machines (SVM) as a Technique for Solvency Analysis , 2008 .

[5]  Shashank Agarwal,et al.  Automatically classifying the role of citations in biomedical articles. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[6]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[7]  Bluma C. Peritz,et al.  A classification of citation roles for the social sciences and related fields , 1983, Scientometrics.

[8]  Peter Haddawy,et al.  Identifying Important Citations Using Contextual Information from Full Text , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[9]  Petr Knoth,et al.  Incidental or Influential? - Challenges in Automatically Detecting Citation Importance Using Publication Full Texts , 2017, TPDL.

[10]  Daryl E. Chubin,et al.  Content Analysis of References: Adjunct or Alternative to Citation Counting? , 1975 .

[11]  Ruben Verborgh,et al.  Social Semantic Search: A Case Study on Web 2.0 for Science , 2017, Int. J. Semantic Web Inf. Syst..

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Guo Zhang,et al.  Content‐based citation analysis: The next generation of citation analysis , 2014, J. Assoc. Inf. Sci. Technol..

[14]  Marc Bertin,et al.  The context of multiple in-text references and their signification , 2017, International Journal on Digital Libraries.

[15]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[16]  Robert E. Mercer,et al.  Towards an Automated Citation Classifier , 2000, Canadian Conference on AI.

[17]  Dragomir R. Radev,et al.  Purpose and Polarity of Citation: Towards NLP-based Bibliometrics , 2013, NAACL.

[18]  Marti A. Hearst,et al.  Citances: Citation Sentences for Semantic Analysis of Bioscience Text , 2004 .

[19]  Jonathan Furner,et al.  Scholarly communication and bibliometrics , 2005, Annu. Rev. Inf. Sci. Technol..

[20]  M. Moravcsik,et al.  Some Results on the Function and Quality of Citations , 1975 .

[21]  Umut Al,et al.  A content-based citation analysis study based on text categorization , 2017, Scientometrics.

[22]  Sophia Ananiadou,et al.  Identification of research hypotheses and new knowledge from scientific literature , 2018, BMC Medical Informatics and Decision Making.

[23]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[24]  Sophia Ananiadou,et al.  Enriching a biomedical event corpus with meta-knowledge annotation , 2011, BMC Bioinformatics.

[25]  Ming Li,et al.  Counting citations in texts rather than reference lists to improve the accuracy of assessing scientific contribution , 2011, BioEssays : news and reviews in molecular, cellular and developmental biology.

[26]  Jorge E. Hirsch,et al.  An index to quantify an individual’s scientific research output that takes into account the effect of multiple coauthorship , 2009, Scientometrics.

[27]  Alexandru T. Balaban Positive and negative aspects of citation indices and journal impact factors , 2012, Scientometrics.

[28]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[29]  Charles Oppenheim,et al.  Highly cited old papers and the reasons why they continue to be cited , 1978, J. Am. Soc. Inf. Sci..

[30]  C. Borgman,et al.  Scholarly Communication and Bibliometrics. , 1992 .

[31]  Terttu Luukkonen,et al.  Is scientists' publishing behaviour rewardseeking? , 1992, Scientometrics.

[32]  E. Garfield The history and meaning of the journal impact factor. , 2006, JAMA.

[33]  Henry G. Small,et al.  Citation context analysis of a co-citation cluster: Recombinant-DNA , 1980, Scientometrics.

[34]  Saeed-Ul Hassan,et al.  Mining the Context of Citations in Scientific Publications , 2018, ICADL.

[35]  Simone Teufel,et al.  Automatic classification of citation function , 2006, EMNLP.

[36]  Awais Athar,et al.  Sentiment Analysis of Citations using Sentence Structure-Based Features , 2011, ACL.

[37]  Nazli Goharian,et al.  Scientific document summarization via citation contextualization and scientific discourse , 2017, International Journal on Digital Libraries.

[38]  Agostino Di Ciaccio,et al.  Machine learning and text mining to classify tweets on a political leader , 2014 .

[39]  Saeed-Ul Hassan,et al.  Deep Stylometry and Lexical & Syntactic Features Based Author Attribution on PLoS Digital Repository , 2017, ICADL.

[40]  Hui Cao,et al.  Approximate RBF Kernel SVM and Its Applications in Pedestrian Classification , 2008 .

[41]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[42]  E. Garfield,et al.  Can Citation Indexing Be Automated ? , 1964 .

[43]  Saeed-Ul Hassan,et al.  A novel machine-learning approach to measuring scientific knowledge flows using citation context analysis , 2018, Scientometrics.

[44]  Ralph Gross,et al.  Multimodal Meeting Tracker , 2000, RIAO.

[45]  D. Lindsey,et al.  Using citation counts as a measure of quality in science measuring what's measurable rather than what's valid , 1989, Scientometrics.

[46]  Paul Zhang,et al.  Semantics-based legal citation network , 2007, ICAIL.

[47]  L. Egghe,et al.  Theory and practise of the g-index , 2006, Scientometrics.

[48]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[49]  C. O. Frost The Use of Citations in Literary Research: A Preliminary Classification of Citation Functions , 1979, The Library Quarterly.

[50]  Manabu Okumura,et al.  Towards Multi-paper Summarization Using Reference Information , 1999, IJCAI.

[51]  Daniel P. Dabney,et al.  Automatic recognition of distinguishing negative indirect history language in judicial opinions , 2001, CIKM '01.

[52]  Achim G. Hoffmann,et al.  Towards topic-based summarization for interactive document viewing , 2003, K-CAP '03.

[53]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.