Towards a stratified learning approach to predict future citation counts

In this paper, we study the problem of predicting future citation count of a scientific article after a given time interval of its publication. To this end, we gather and conduct an exhaustive analysis on a dataset of more than 1.5 million scientific papers of computer science domain. On analysis of the dataset, we notice that the citation count of the articles over the years follows a diverse set of patterns; on closer inspection we identify six broad categories of citation patterns. This important observation motivates us to adopt stratified learning approach in the prediction task, whereby, we propose a two-stage prediction model - in the first stage, the model maps a query paper into one of the six categories, and then in the second stage a regression module is run only on the subpopulation corresponding to that category to predict the future citation count of the query paper. Experimental results show that the categorization of this huge dataset during the training phase leads to a remarkable improvement (around 50%) in comparison to the well-known baseline system.

[1]  Niloy Ganguly,et al.  Automatic Classification and Analysis of Interdisciplinary Fields in Computer Sciences , 2013, 2013 International Conference on Social Computing.

[2]  Y. Gingras,et al.  The Effects of Aging on Researchers' Publication and Citation Patterns , 2008, PloS one.

[3]  Concha Bielza,et al.  Predicting citation count of Bioinformatics papers within four years of publication , 2009, Bioinform..

[4]  Simone Teufel,et al.  Automatic classification of citation function , 2006, EMNLP.

[5]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[6]  Yi Lin Multicategory Support Vector Machines, Theory, and Application to the Classification of . . . , 2003 .

[7]  R. Wears,et al.  Journal prestige, publication bias, and other characteristics associated with citation of published studies in peer-reviewed journals. , 2002, JAMA.

[8]  Simone Teufel,et al.  Whose Idea Was This, and Why Does it Matter? Attributing Scientific Work to Citations , 2007, HLT-NAACL.

[9]  Guillermo Sapiro,et al.  Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds , 2006, NIPS.

[10]  Aristides Gionis,et al.  Estimating Number of Citations Using Author Reputation , 2007, SPIRE.

[11]  J. Rodgers,et al.  Thirteen ways to look at the correlation coefficient , 1988 .

[12]  Eugene Garfield,et al.  Impact factors, and why they won't go away , 2001, Nature.

[13]  Daniel Jurafsky,et al.  Who should I cite: learning literature search models from citation behavior , 2010, CIKM.

[14]  Yan Zhang,et al.  To better stand on the shoulder of giants , 2012, JCDL '12.

[15]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Lawrence D. Fu,et al.  Models for Predicting and Explaining Citation Count of Biomedical Articles , 2008, AMIA.

[18]  Mônica G. Campiteli,et al.  An index to quantify an individual's scientific research valid across disciplines , 2005 .

[19]  Niloy Ganguly,et al.  Computer science fields as ground-truth communities: Their impact, rise and fall , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[20]  F. Windmeijer,et al.  An R-squared measure of goodness of fit for some common nonlinear regression models , 1997 .

[21]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[22]  A. Kulkarni,et al.  Characteristics Associated with Citation Rate of the Medical Literature , 2007, PloS one.

[23]  Daniel McNamara,et al.  Predicting High Impact Academic Papers Using Citation Network Features , 2013, PAKDD Workshops.

[24]  Jie Tang,et al.  Citation count prediction: learning to estimate future citations for literature , 2011, CIKM '11.

[25]  Kim L. Boyer,et al.  Stratified learning of local anatomical context for lung nodules in CT images , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26]  Mike Thelwall,et al.  Determinants of research citation impact in nanoscience and nanotechnology , 2013, J. Assoc. Inf. Sci. Technol..