A lead‐lag analysis of the topic evolution patterns for preprints and publications

This study applied LDA (latent Dirichlet allocation) and regression analysis to conduct a lead‐lag analysis to identify different topic evolution patterns between preprints and papers from arXiv and the Web of Science (WoS) in astrophysics over the last 20 years (1992–2011). Fifty topics in arXiv and WoS were generated using an LDA algorithm and then regression models were used to explain 4 types of topic growth patterns. Based on the slopes of the fitted equation curves, the paper redefines the topic trends and popularity. Results show that arXiv and WoS share similar topics in a given domain, but differ in evolution trends. Topics in WoS lose their popularity much earlier and their durations of popularity are shorter than those in arXiv. This work demonstrates that open access preprints have stronger growth tendency as compared to traditional printed publications.

[1]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[2]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[3]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[4]  Ying Ding,et al.  Community detection: Topological vs. topical , 2011, J. Informetrics.

[5]  Thorsten Joachims,et al.  Identifying Temporal Patterns and Key Players in Document Collections , 1995 .

[6]  Daniel A. McFarland,et al.  Who Leads Whom : Topical Lead-Lag Analysis across Corpora , 2010 .

[7]  Sean Gerrish,et al.  A Language-based Approach to Measuring Scholarly Impact , 2010, ICML.

[8]  David Jensen,et al.  TimeMines: Constructing Timelines with Statistical Models of Word Usage , 2000, KDD 2000.

[9]  H. White A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity , 1980 .

[10]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[11]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[12]  Johann Hofherr,et al.  Mapping the research on aquaculture. A bibliometric analysis of aquaculture literature , 2011, Scientometrics.

[13]  Ying Ding,et al.  Scientific collaboration and endorsement: Network analysis of coauthorship and citation networks , 2011, J. Informetrics.

[14]  Johan Bollen,et al.  How the Scientific Community Reacts to Newly Submitted Preprints: Article Downloads, Twitter Mentions, and Citations , 2012, PloS one.

[15]  M. Stone The Generalized Weierstrass Approximation Theorem , 1948 .

[16]  Vincent Larivière,et al.  arXiv E‐prints and the journal of record: An analysis of roles and relationships , 2013, J. Assoc. Inf. Sci. Technol..

[17]  M. Bartlett Properties of Sufficiency and Statistical Tests , 1992 .

[18]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[19]  Ruoming Jin,et al.  A Topic Modeling Approach and Its Integration into the Random Walk Framework for Academic Search , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[20]  Jinsong Zhang,et al.  Full-text citation analysis: enhancing bibliometric and scientific publication ranking , 2012, CIKM.

[21]  Cassidy R. Sugimoto,et al.  Topics in dynamic research communities: An exploratory study for the field of information retrieval , 2012, J. Informetrics.

[22]  Guo Zhang,et al.  Productivity and influence in bioinformatics: A bibliometric analysis using PubMed central , 2014, J. Assoc. Inf. Sci. Technol..

[23]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[24]  Clive W. J. Granger,et al.  Developments in the study of cointegrated economic variables , 2001 .

[25]  Gideon S. Mann,et al.  Bibliometric impact measures leveraging topic analysis , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[26]  Ying Ding,et al.  Topic-based PageRank on author cocitation networks , 2011, J. Assoc. Inf. Sci. Technol..

[27]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[28]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[29]  Cassidy R. Sugimoto,et al.  The shifting sands of disciplinary development: Analyzing North American Library and Information Science dissertations using latent Dirichlet allocation , 2011, J. Assoc. Inf. Sci. Technol..

[30]  P. A. Blight The Analysis of Time Series: An Introduction , 1991 .

[31]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..