The differences between latent topics in abstracts and citation contexts of citing papers

Although it is commonly expected that the citation context of a reference is likely to provide more detailed and direct information about the nature of a citation, few studies in the literature have specifically addressed the extent to which the information in different parts of a scientific publication differs. Do abstracts tend to use conceptually broader terms than sentences in a citation context in the body of a publication? In this article, we propose a method to analyze and compare latent topics in scientific publications, in particular, from abstracts of papers that cited a target reference and from sentences that cited the target reference. We conducted an experiment and applied topical modeling techniques to full‐text papers in eight biomedicine journals. Topics derived from the two sources are compared in terms of their similarities and broad‐narrow relationships defined based on information entropy. The results show that abstracts and citation contexts are characterized by distinct sets of topics with moderate overlaps. Furthermore, the results confirm that topics from abstracts of citing papers have broader terms than topics from citation contexts formed by citing sentences. The method and the findings could be used to enhance and extend the current methodologies for research evaluation and citation evaluation.

[1]  Bart De Moor,et al.  Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis , 2007, KDD '07.

[2]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[3]  Panos Constantopoulos,et al.  Research and Advanced Technology for Digital Libraries , 2001, Lecture Notes in Computer Science.

[4]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[5]  M. Moravcsik,et al.  Some Results on the Function and Quality of Citations , 1975 .

[6]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[7]  Henry G. Small,et al.  Interpreting maps of science using citation context sentiments: a preliminary investigation , 2011, Scientometrics.

[8]  Simone Teufel,et al.  Whose Idea Was This, and Why Does it Matter? Attributing Scientific Work to Citations , 2007, HLT-NAACL.

[9]  Manabu Okumura,et al.  Towards Multi-paper Summarization Using Reference Information , 1999, IJCAI.

[10]  Dragomir R. Radev,et al.  Using Citations to Generate surveys of Scientific Paradigms , 2009, NAACL.

[11]  Henry G. Small,et al.  The synthesis of specialty narratives from co-citation clusters , 1986, J. Am. Soc. Inf. Sci..

[12]  Kristian J. Hammond,et al.  Reference directed indexing: indexing scientific literature in the context of its use , 2002 .

[13]  S Kullback,et al.  LETTER TO THE EDITOR: THE KULLBACK-LEIBLER DISTANCE , 1987 .

[14]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[15]  E. Garfield Citation analysis as a tool in journal evaluation. , 1972, Science.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Dragomir R. Radev,et al.  Scientific Paper Summarization Using Citation Summary Networks , 2008, COLING.

[18]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[20]  Gilda Massari Coelho,et al.  Text mining as a valuable tool in foresight exercises: A study on nanotechnology , 2006 .

[21]  Ronald N. Kostoff,et al.  Citation mining: Integrating text mining and bibliometrics for research user profiling , 2001, J. Assoc. Inf. Sci. Technol..

[22]  John O'Connor Biomedical citing statements: Computer recognition and use to aid full-text retrieval , 1983, Inf. Process. Manag..

[23]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[24]  ChengXiang Zhai,et al.  Generating Impact-Based Summaries for Scientific Literature , 2008, ACL.

[25]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[26]  Henry G. Small,et al.  Maps of science as interdisciplinary discourse: co-citation contexts and the role of analogy , 2010, Scientometrics.

[27]  Chaomei Chen,et al.  CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature , 2006, J. Assoc. Inf. Sci. Technol..

[28]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[29]  Marti A. Hearst,et al.  Citances: Citation Sentences for Semantic Analysis of Bioscience Text , 2004 .

[30]  Frank D. Wood,et al.  Hierarchically Supervised Latent Dirichlet Allocation , 2011, NIPS.

[31]  Bing He,et al.  The dynamic features of Delicious, Flickr, and YouTube , 2011, J. Assoc. Inf. Sci. Technol..

[32]  John O'Connor,et al.  Citing statements: Computer recognition and use to improve retrieval , 1982, Inf. Process. Manag..

[33]  Simone Teufel,et al.  Automatic classification of citation function , 2006, EMNLP.

[34]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[35]  Andrew McCallum,et al.  A Note on Topical N-grams , 2005 .

[36]  Bart De Moor,et al.  Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database , 2010, J. Assoc. Inf. Sci. Technol..

[37]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[38]  Manabu Okumura,et al.  Automatic Detection of Survey Articles , 2005, ECDL.

[39]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[40]  Ying Ding,et al.  Topic-based PageRank on author cocitation networks , 2011, J. Assoc. Inf. Sci. Technol..

[41]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.