Predicting a Scientific Community’s Response to an Article

We consider the problem of predicting measurable responses to scientific articles based primarily on their text content. Specifically, we consider papers in two fields (economics and computational linguistics) and make predictions about downloads and within-community citations. Our approach is based on generalized linear models, allowing interpretability; a novel extension that captures first-order temporal effects is also presented. We demonstrate that text features significantly improve accuracy of predictions over metadata features like authors, topical categories, and publication venues.

[1]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[2]  P. McCullagh Regression Models for Ordinal Data , 1980 .

[3]  Eric P. Xing,et al.  Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream , 2010, UAI.

[4]  David D. Jensen,et al.  Exploiting relational structure to understand publication patterns in high-energy physics , 2003, SKDD.

[5]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[6]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[7]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[8]  Daniel Jurafsky,et al.  Who should I cite: learning literature search models from citation behavior , 2010, CIKM.

[9]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[10]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[11]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[12]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[13]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[14]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.

[15]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986 .

[17]  Dragomir R. Radev,et al.  Scientific Paper Summarization Using Citation Summary Networks , 2008, COLING.

[18]  J. T. Wulu,et al.  Regression analysis of count data , 2002 .

[19]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[20]  Noah A. Smith,et al.  Predicting Risk from Financial Reports with Regression , 2009, NAACL.

[21]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[22]  Noah A. Smith,et al.  Movie Reviews and Revenues: An Experiment in Text Regression , 2010, NAACL.

[23]  Chaomei Chen,et al.  Visualizing knowledge domains , 2005, Annu. Rev. Inf. Sci. Technol..

[24]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[25]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[26]  Jure Leskovec,et al.  The Download Estimation task on KDD Cup 2003 , 2003, SKDD.

[27]  Christopher D. Manning,et al.  Which universities lead and lag ? Toward university rankings based on scholarly output , 2010 .

[28]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[29]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[30]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[31]  Sean Gerrish,et al.  A Language-based Approach to Measuring Scholarly Impact , 2010, ICML.