Predicting clicks of PubMed articles

Predicting the popularity or access usage of an article has the potential to improve the quality of PubMed searches. We can model the click trend of each article as its access changes over time by mining the PubMed query logs, which contain the previous access history for all articles. In this article, we examine the access patterns produced by PubMed users in two years (July 2009 to July 2011). We explore the time series of accesses for each article in the query logs, model the trends with regression approaches, and subsequently use the models for prediction. We show that the click trends of PubMed articles are best fitted with a log-normal regression model. This model allows the number of accesses an article receives and the time since it first becomes available in PubMed to be related via quadratic and logistic functions, with the model parameters to be estimated via maximum likelihood. Our experiments predicting the number of accesses for an article based on its past usage demonstrate that the mean absolute error and mean absolute percentage error of our model are 4.0% and 8.1% lower than the power-law regression model, respectively. The log-normal distribution is also shown to perform significantly better than a previous prediction method based on a human memory theory in cognitive science. This work warrants further investigation on the utility of such a log-normal regression approach towards improving information access in PubMed.

[1]  Pablo Rodriguez,et al.  I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system , 2007, IMC '07.

[2]  Zhiyong Lu,et al.  Identifying related journals through log analysis , 2009, Bioinform..

[3]  Colin Lewis Demand Forecasting and Inventory Control , 1997 .

[4]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[5]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[6]  Zhiyong Lu,et al.  Understanding PubMed® user search behavior through log analysis , 2009, Database J. Biol. Databases Curation.

[7]  S. Lawrence Free online availability substantially increases a paper's impact , 2001, Nature.

[8]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[9]  Bernardo A. Huberman,et al.  Predicting the popularity of online content , 2008, Commun. ACM.

[10]  Zhiyong Lu,et al.  Click-words: learning to predict document keywords from a user perspective , 2010, Bioinform..

[11]  S. Greenland,et al.  Methods for trend estimation from summarized dose-response data, with applications to meta-analysis. , 1992, American journal of epidemiology.

[12]  Zhiyong Lu,et al.  Improving accuracy for identifying related PubMed queries by an integrated approach , 2009, J. Biomed. Informatics.

[13]  Elmer V. Bernstam,et al.  A day in the life of PubMed: analysis of a typical day's query log. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[14]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[15]  Tad Hogg,et al.  Using Stochastic Models to Describe and Predict Social Dynamics of Web Users , 2010, TIST.

[16]  Time Series Analysis , 2005 .

[17]  Peter Nijkamp,et al.  Accessibility of Cities in the Digital Economy , 2004, cond-mat/0412004.

[18]  David G. T. Denison Nonparametric Bayesian Regression Methods , 2007 .

[19]  Todd R. Johnson,et al.  Focus on information retrieval: Predicting biomedical document access as a function of past use , 2012, J. Am. Medical Informatics Assoc..

[20]  L. Smith,et al.  The Popularity of Articles in PubMed , 2011 .

[21]  Kavé Salamatian,et al.  An Approach to Model and Predict the Popularity of Online Contents with Explanatory Factors , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[22]  QUENTIN BURRELL,et al.  A Simple stochastic Model for Library loans , 1980, J. Documentation.

[23]  Zhiyong Lu,et al.  PubMed and beyond: a survey of web tools for searching biomedical literature , 2011, Database J. Biol. Databases Curation.

[24]  John R. Anderson,et al.  Reflections of the Environment in Memory Form of the Memory Functions , 2022 .

[25]  Zhiyong Lu,et al.  Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfaction , 2011, J. Biomed. Informatics.

[26]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[27]  D. Cox Regression Models and Life-Tables , 1972 .

[28]  Zhiyong Lu,et al.  Finding Query Suggestions for PubMed , 2009, AMIA.

[29]  Zhiyong Lu,et al.  Viewpoint Paper: Evaluating Relevance Ranking Strategies for MEDLINE Retrieval , 2009, J. Am. Medical Informatics Assoc..

[30]  J Li,et al.  Developing Topic-specific Search Filters for PubMed with Click-through Data , 2013, Methods of Information in Medicine.