Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature

The most popular method for judging the impact of biomedical articles is citation count which is the number of citations received. The most significant limitation of citation count is that it cannot evaluate articles at the time of publication since citations accumulate over time. This work presents computer models that accurately predict citation counts of biomedical publications within a deep horizon of 10 years using only predictive information available at publication time. Our experiments show that it is indeed feasible to accurately predict future citation counts with a mixture of content-based and bibliometric features using machine learning methods. The models pave the way for practical prediction of the long-term impact of publication, and their statistical analysis provides greater insight into citation behavior.

[1]  Yindalon Aphinyanagphongs,et al.  Text Categorization Models for Retrieval of High Quality Articles in Internal Medicine , 2003, AMIA.

[2]  Constantin F. Aliferis,et al.  Challenges in the Analysis of Mass-Throughput Data: A Technical Commentary from the Statistical Machine Learning Perspective , 2006, Cancer informatics.

[3]  P. Seglen,et al.  Citation rates and journal impact factors are not suitable for evaluation of research. , 1998, Acta orthopaedica Scandinavica.

[4]  E. Garfield,et al.  Can Citation Indexing Be Automated ? , 1964 .

[5]  Dror G. Feitelson,et al.  Predictive ranking of computer scientists using CiteSeer data , 2004, J. Documentation.

[6]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions , 2010, J. Mach. Learn. Res..

[7]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[8]  Yindalon Aphinyanagphongs,et al.  Research Paper: Text Categorization Models for High-Quality Article Retrieval in Internal Medicine , 2004, J. Am. Medical Informatics Assoc..

[9]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[10]  Lise Getoor,et al.  Link mining: a new data mining challenge , 2003, SKDD.

[11]  Michael H. MacRoberts,et al.  Problems of citation analysis , 1996, Scientometrics.

[12]  K. A. McKibbon,et al.  Prediction of citation counts for clinical articles at two years using data available within three weeks of publication: retrospective cohort study , 2008, BMJ : British Medical Journal.

[13]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[14]  P. Gross,et al.  COLLEGE LIBRARIES AND CHEMICAL EDUCATION. , 1927, Science.

[15]  T. J. Phelan,et al.  A compendium of issues for citation analysis , 1999, Scientometrics.

[16]  Lawrence D. Fu,et al.  Models for Predicting and Explaining Citation Count of Biomedical Articles , 2008, AMIA.

[17]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[18]  David D. Jensen,et al.  The case for anomalous link discovery , 2005, SKDD.

[19]  Michael H. MacRoberts,et al.  Problems of citation analysis , 1992, Scientometrics.