Scientific papers citation analysis using textual features and SMOTE resampling techniques

Abstract Ascertaining the impact of research is significant for the research community and academia of all disciplines. The only prevalent measure associated with the quantification of research quality is the citation-count. Although a number of citations play a significant role in academic research, sometimes citations can be biased or made to discuss only the weaknesses and shortcomings of the research. By considering the sentiment of citations and recognizing patterns in text can aid in understanding the opinion of the peer research community and will also help in quantifying the quality of research articles. Efficient feature representation combined with machine learning classifiers has yielded significant improvement in text classification. However, the effectiveness of such combinations has not been analyzed for citation sentiment analysis. This study aims to investigate pattern recognition using machine learning models in combination with frequency-based and prediction-based feature representation techniques with and without using Synthetic Minority Oversampling Technique (SMOTE) on publicly available citation sentiment dataset. Sentiment of citation instances are classified into positive, negative or neutral. Results indicate that the Extra tree classifier in combination with Term Frequency-Inverse Document Frequency achieved 98.26% accuracy on the SMOTE-balanced dataset.

[1]  Sheraz Ahmed,et al.  ImpactCite: An XLNet-based method for Citation Impact Analysis , 2020, ICAART.

[2]  Simone Teufel,et al.  Detection of Implicit Citations for Sentiment Detection , 2012, ACL 2012.

[3]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[4]  Henry G. Small,et al.  Analysis of scientific literature to assist in problem solving , 1989, JASIS.

[5]  W. Gardner Learning characteristics of stochastic-gradient-descent algorithms: A general study, analysis, and critique , 1984 .

[6]  S Sendhilkumar,et al.  Citation Semantic Based Approaches to Identify Article Quality , 2013 .

[7]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[8]  Somula Ramasubbareddy,et al.  Sentiment Analysis on Movie Reviews , 2020 .

[9]  Andreas Dengel,et al.  ImpactCite: An XLNet-based Solution Enabling Qualitative Citation Impact Analysis Utilizing Sentiment and Intent , 2021, ICAART.

[10]  Michele Nappi,et al.  Emotion Recognition by Textual Tweets Classification Using Voting Classifier (LR-SGD) , 2021, IEEE Access.

[11]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[12]  Gyu Sang Choi,et al.  Aggression detection through deep neural model on Twitter , 2021, Future Gener. Comput. Syst..

[13]  Karsten Weihe,et al.  Improve Sentiment Analysis of Citations with Author Modelling , 2016, WASSA@NAACL-HLT.

[14]  Lutz Bornmann Which kind of papers has higher or lower altmetric counts? A study using article-level metrics from PLOS and F1000Prime , 2014, ArXiv.

[15]  Uma Ojha,et al.  Index for objective measurement of a research paper based on sentiment analysis , 2020, ICT Express.

[16]  MaryEllen C. Sievert,et al.  An editor's influence on citation patterns: A case study of Elementary School Journal , 1989, JASIS.

[17]  Bernhard Schölkopf,et al.  Incorporating Invariances in Support Vector Learning Machines , 1996, ICANN.

[18]  R. Simon Sherratt,et al.  Sentiment Analysis for E-Commerce Product Reviews in Chinese Based on Sentiment Lexicon and Deep Learning , 2020, IEEE Access.

[19]  J. Michael Lindsay,et al.  PlumX from Plum Analytics: Not Just Altmetrics , 2016 .

[20]  Simone Teufel,et al.  Context-Enhanced Citation Sentiment Detection , 2012, NAACL.

[21]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[22]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[23]  M. Miller,et al.  Citations, contexts, and humanistic discourse: Toward automatic extraction and classification , 2014, Lit. Linguistic Comput..

[24]  Carl T. Bergstrom,et al.  Assessing citations with the Eigenfactor™ Metrics , 2008, Neurology.

[25]  Lutz Bornmann,et al.  What do citation counts measure? A review of studies on citing behavior , 2008, J. Documentation.

[26]  Dipankar Das,et al.  Determining Sentiment in Citation Text and Analyzing Its Impact on the Proposed Ranking Index , 2016, CICLing.

[27]  Yaoyun Zhang,et al.  Citation Sentiment Analysis in Clinical Trial Papers , 2015, AMIA.

[28]  E. Garfield The history and meaning of the journal impact factor. , 2006, JAMA.

[29]  Erik Cambria,et al.  Fuzzy commonsense reasoning for multimodal sentiment analysis , 2019, Pattern Recognit. Lett..

[30]  Remedios Melero,et al.  Altmetrics – a complement to conventional metrics , 2015, Biochemia medica.

[31]  Mengjiao Wang,et al.  User personality prediction based on topic preference and sentiment analysis using LSTM model , 2020, Pattern Recognit. Lett..

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  José M. Gómez,et al.  Citation Impact Categorization: For Scientific Literature , 2015, 2015 IEEE 18th International Conference on Computational Science and Engineering.

[34]  Shijie Cao,et al.  Ligand modified nanoparticles increases cell uptake, alters endocytosis and elevates glioma distribution and internalization , 2013, Scientific Reports.

[35]  Gyu Sang Choi,et al.  Extensive hotel reviews classification using long short term memory , 2020, Journal of Ambient Intelligence and Humanized Computing.

[36]  Santo Fortunato,et al.  Author Impact Factor: tracking the dynamics of individual scientific impact , 2013, Scientific Reports.

[37]  Dragomir R. Radev,et al.  Purpose and Polarity of Citation: Towards NLP-based Bibliometrics , 2013, NAACL.

[38]  Il-Chul Moon,et al.  Efficient extraction of domain specific sentiment lexicon with active learning , 2015, Pattern Recognit. Lett..

[39]  Ricardo Arencibia-Jorge,et al.  Comparison of SCImago journal rank indicator with journal impact factor , 2008, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[40]  Judit Bar-Ilan,et al.  Post retraction citations in context: a case study , 2017, Scientometrics.

[41]  Carl T. Bergstrom,et al.  Author-level Eigenfactor metrics: Evaluating the influence of authors, institutions, and countries within the social science research network community , 2013, J. Assoc. Inf. Sci. Technol..

[42]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[43]  W. Copes,et al.  Evaluating trauma care: the TRISS method. Trauma Score and the Injury Severity Score. , 1987, The Journal of trauma.

[44]  V. Cano,et al.  Citation behavior: Classification, utility, and location , 1989, JASIS.

[45]  Awais Athar,et al.  Sentiment Analysis of Citations using Sentence Structure-Based Features , 2011, ACL.

[46]  Sarah Huggett,et al.  Journal bibliometrics indicators and citation ethics: a discussion of current issues. , 2013, Atherosclerosis.

[47]  Lada A. Adamic,et al.  The Impact of Boundary Spanning Scholarly Publications and Patents , 2009, PloS one.

[48]  Aakanksha Sharaff,et al.  Extra-Tree Classifier with Metaheuristics Approach for Email Classification , 2019, Advances in Intelligent Systems and Computing.

[49]  Zhendong Niu,et al.  A survey on sentiment analysis of scientific citations , 2019, Artificial Intelligence Review.

[50]  Hanan Aljuaid,et al.  Important citation identification using sentiment analysis of in-text citations , 2021, Telematics Informatics.

[51]  I. Spiegel-Rosing Science Studies: Bibliometric and Content Analysis , 1977 .