Successful fish go with the flow: citation impact prediction based on centrality measures for term–document networks

In this work we address the challenge of how to identify those documents from a given set of texts that are most likely to have substantial impact in the future. To this end we develop a purely content-based methodology in order to rank a given set of documents, for example abstracts of scientific publications, according to their potential to generate impact as measured by the numbers of citations that the articles will receive in the future. We construct a bipartite network consisting of documents that are linked to keywords and terms that they contain. We study recursive centrality measures for such networks that quantify how many different terms a document contains and how these terms are related to each other. From this we derive a novel indicator—document centrality—that is shown to be highly predictive of citation impact in six different case studies. We compare these results to findings from a multivariable regression model and from conventional network-based centrality measures to show that document centrality indeed offers a comparably high performance in identifying those articles that contain a large number of high-impact keywords. Our findings suggest that articles which conform to the mainstream within a given research field tend to receive higher numbers of citations than highly original and innovative articles.

[1]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[2]  J. A. Stewart,et al.  Achievement and Ascriptive Processes in the Recognition of Scientific Articles , 1983 .

[3]  Jianhua Guo,et al.  A Bayesian feature selection paradigm for text classification , 2012, Inf. Process. Manag..

[4]  Loet Leydesdorff,et al.  How are new citation-based journal indicators adding to the bibliometric toolbox? , 2009, J. Assoc. Inf. Sci. Technol..

[5]  Sameem Abdul Kareem,et al.  Identifying ISI‐indexed articles by their lexical usage: A text analysis approach , 2015, J. Assoc. Inf. Sci. Technol..

[6]  Liangxiao Jiang,et al.  Bayesian Citation-KNN with distance weighting , 2014, Int. J. Mach. Learn. Cybern..

[7]  Cassidy R. Sugimoto,et al.  P-Rank: An indicator measuring prestige in heterogeneous scholarly networks , 2011, J. Assoc. Inf. Sci. Technol..

[8]  Jiawei Han,et al.  Citation Prediction in Heterogeneous Bibliographic Networks , 2012, SDM.

[9]  Lise Getoor,et al.  FutureRank: Ranking Scientific Articles by Predicting their Future PageRank , 2009, SDM.

[10]  Loet Leydesdorff,et al.  Betweenness centrality as an indicator of the interdisciplinarity of scientific journals , 2007, J. Assoc. Inf. Sci. Technol..

[11]  Elizabeth S. Vieira,et al.  Citations to scientific articles: Its distribution and dependence on the article features , 2010, J. Informetrics.

[12]  Claudio Castellano,et al.  Testing the fairness of citation indicators for comparison across scientific domains: The case of fractional citation counts , 2011, J. Informetrics.

[13]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[14]  Stefan Thurner,et al.  Instrumentational Complexity of Music Genres and Why Simplicity Sells , 2014, PloS one.

[15]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[16]  Loet Leydesdorff,et al.  Integrated Impact Indicators (I3) compared with Impact Factors (IFs): An alternative research design with policy implications , 2011, J. Assoc. Inf. Sci. Technol..

[17]  Kène Henkens,et al.  What Makes a Scientific article Influential , 2000 .

[18]  Johan Bollen,et al.  A Principal Component Analysis of 39 Scientific Impact Measures , 2009, PloS one.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[21]  Mike Thelwall,et al.  Determinants of research citation impact in nanoscience and nanotechnology , 2013, J. Assoc. Inf. Sci. Technol..

[22]  Loet Leydesdorff,et al.  Turning the tables in citation analysis one more time: Principles for comparing sets of documents by using an “Integrated Impact Indicator” (I3) , 2011 .

[23]  Sergei Maslov,et al.  Ranking scientific publications using a model of network traffic , 2006, ArXiv.

[24]  Sergei Maslov,et al.  Finding scientific gems with Google's PageRank algorithm , 2006, J. Informetrics.

[25]  Michael H. MacRoberts,et al.  Problems of citation analysis , 1992, Scientometrics.

[26]  Kurt Hornik,et al.  The support vector machine under test , 2003, Neurocomputing.

[27]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[28]  魏屹东,et al.  Scientometrics , 2018, Encyclopedia of Big Data.

[29]  Santo Fortunato,et al.  Diffusion of scientific credits and the ranking of scientists , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[30]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[31]  Christopher M. Danforth,et al.  Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter , 2011, PloS one.

[32]  Albert-László Barabási,et al.  Quantifying Long-Term Scientific Impact , 2013, Science.

[33]  César A. Hidalgo,et al.  The building blocks of economic complexity , 2009, Proceedings of the National Academy of Sciences.

[34]  Johan Bollen,et al.  Journal status , 2006, Scientometrics.

[35]  Lawrence D. Fu,et al.  Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature , 2010, Scientometrics.

[36]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[37]  Zhao-Guo Xuan,et al.  Weighted network properties of Chinese nature science basic research , 2007 .

[38]  Gunther Eysenbach,et al.  Can Tweets Predict Citations? Metrics of Social Impact Based on Twitter and Correlation with Traditional Metrics of Scientific Impact , 2011, Journal of medical Internet research.

[39]  M. E. J. Newman,et al.  The first-mover advantage in scientific publication , 2008, 0809.0522.

[40]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[41]  R. Wears,et al.  Journal prestige, publication bias, and other characteristics associated with citation of published studies in peer-reviewed journals. , 2002, JAMA.

[42]  Tian Yu,et al.  Citation impact prediction for scientific papers using stepwise regression analysis , 2014, Scientometrics.

[43]  Hendrik P. van Dalen,et al.  What makes a scientific article influential? The case of demographers , 2001, Scientometrics.

[44]  Gabor Pataki,et al.  A Principal Component Analysis for Trees , 2008, 0810.0944.

[45]  Rickard Danell,et al.  Can the quality of scientific work be predicted using information on the author's track record? , 2011, J. Assoc. Inf. Sci. Technol..

[46]  M. Newman Coauthorship networks and patterns of scientific collaboration , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Lutz Bornmann,et al.  What factors determine citation counts of publications in chemistry besides their quality? , 2012, J. Informetrics.

[48]  James T. Kwok,et al.  Automated Text Categorization Using Support Vector Machine , 1998, ICONIP.

[49]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[50]  Andrea Bergmann,et al.  Citation Indexing Its Theory And Application In Science Technology And Humanities , 2016 .

[51]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[52]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[53]  O. Renn,et al.  Search for the ‘European way’ of taming the risks of new technologies: the EU research project iNTeg-Risk , 2013 .

[54]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.