Citation Classification for Behavioral Analysis of a Scientific Field

Citations are an important indicator of the state of a scientific field, reflecting how authors frame their work, and influencing uptake by future scholars. However, our understanding of citation behavior has been limited to small-scale manual citation analysis. We perform the largest behavioral study of citations to date, analyzing how citations are both framed and taken up by scholars in one entire field: natural language processing. We introduce a new dataset of nearly 2,000 citations annotated for function and centrality, and use it to develop a state-of-the-art classifier and label the entire ACL Reference Corpus. We then study how citations are framed by authors and use both papers and online traces to track how citations are followed by readers. We demonstrate that authors are sensitive to discourse structure and publication venue when citing, that online readers follow temporal links to previous and future work rather than methodological links, and that how a paper cites related work is predictive of its citation count. Finally, we use changes in citation roles to show that the field of NLP is undergoing a significant increase in consensus.

[1]  R. O’Brien,et al.  A Caution Regarding Rules of Thumb for Variance Inflation Factors , 2007 .

[2]  Jie Tang,et al.  Citation count prediction: learning to estimate future citations for literature , 2011, CIKM '11.

[3]  Sandra H. Rouse,et al.  Human information seeking: Online searching of bibliographic citation networks , 1982, Inf. Process. Manag..

[4]  Charles Oppenheim,et al.  Highly cited old papers and the reasons why they continue to be cited , 1978, J. Am. Soc. Inf. Sci..

[5]  Jian Pei,et al.  Citation recommendation without author supervision , 2011, WSDM '11.

[6]  Cornelia Caragea,et al.  Context Sensitive Topic Models for Author Influence in Document Networks , 2011, IJCAI.

[7]  Manabu Okumura,et al.  Towards Multi-paper Summarization Using Reference Information , 1999, IJCAI.

[8]  Srijan Kumar,et al.  Structure and Dynamics of Signed Citation Networks , 2016, WWW.

[9]  Dragomir R. Radev,et al.  Reference Scope Identification in Citing Sentences , 2012, NAACL.

[10]  Peter Bergström,et al.  CircleView : Scalable Visualization and Navigation of Citation Networks , 2006 .

[11]  Daryl E. Chubin,et al.  Content Analysis of References: Adjunct or Alternative to Citation Counting? , 1975 .

[12]  Peter Vinkler,et al.  Comparative investigation of frequency and strength of motives toward referencing. The reference threshold model , 1998, Scientometrics.

[13]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  Yan Zhang,et al.  To better stand on the shoulder of giants , 2012, JCDL '12.

[16]  Terrence A. Brooks,et al.  Evidence of complex citer motivations , 1986, J. Am. Soc. Inf. Sci..

[17]  Stuart M. Shieber,et al.  A Uniform Architecture for Parsing and Generation , 1988, COLING.

[18]  David H. D. Warren,et al.  Parsing as Deduction , 1983, ACL.

[19]  L. Leemis Applied Linear Regression Models , 1991 .

[20]  Awais Athar,et al.  Sentiment analysis of scientific citations , 2014 .

[21]  Jure Leskovec,et al.  Human wayfinding in information networks , 2012, WWW.

[22]  Niloy Ganguly,et al.  Towards a stratified learning approach to predict future citation counts , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[23]  J. Swales CITATION ANALYSIS AND DISCOURSE ANALYSIS , 1986 .

[24]  Robert E. Mercer,et al.  The Frequency of Hedging Cues in Citation Contexts in Scientific Writing , 2004, Canadian Conference on AI.

[25]  Ulrich Schäfer,et al.  Ensemble-style Self-training on Citation Classification , 2011, IJCNLP.

[26]  Karen Sparck Jones Natural Language Processing: A Historical Review , 1994 .

[27]  K. Nwogu The medical research paper: Structure and functions , 1997 .

[28]  Simone Teufel,et al.  Automatic classification of citation function , 2006, EMNLP.

[29]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[30]  Walter Daelemans,et al.  Cascaded Grammatical Relation Assignment , 1999, EMNLP.

[31]  Daniel Jurafsky,et al.  Towards a Computational History of the ACL: 1980-2008 , 2012, Discoveries@ACL.

[32]  Stephen Cole,et al.  MEASURING THE QUALITY OF SOCIOLOGICAL RESEARCH : PROBLEMS IN THE USE OF THE SCIENCE CITATION ZNDEX , 1971 .

[33]  Oren Etzioni,et al.  Identifying Meaningful Citations , 2015, AAAI Workshop: Scholarly Big Data.

[34]  Ali Gazni,et al.  Investigating different types of research collaboration and citation impact: a case study of Harvard University’s publications , 2011, Scientometrics.

[35]  Vincent Larivière,et al.  The invariant distribution of references in scientific articles , 2016, J. Assoc. Inf. Sci. Technol..

[36]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[37]  Wiebe E. Bijker,et al.  Science in action : how to follow scientists and engineers through society , 1989 .

[38]  Donald Owen Case,et al.  How can we investigate citation behavior? A study of reasons for citing literature in communication , 2000, J. Am. Soc. Inf. Sci..

[39]  Andrew R. Haas,et al.  A Parsing Algorithm for Unification Grammar , 1989, Comput. Linguistics.

[40]  Nigel Harwood An interview-based study of the functions of citations in academic writing across two disciplines , 2009 .

[41]  Guo Zhang,et al.  Content‐based citation analysis: The next generation of citation analysis , 2014, J. Assoc. Inf. Sci. Technol..

[42]  Duncan Lindsey,et al.  Production and Citation Measures in the Sociology of Science: The Problem of Multiple Authorship , 1980 .

[43]  R. Collins,et al.  Why the social sciences won't become high-consensus, rapid-discovery science , 1994 .

[44]  D. Cases,et al.  How can we investigate citation behavior?: a study of reasons for citing literature in communication , 2000 .

[45]  Simone Teufel,et al.  Argumentative zoning information extraction from scientific text , 1999 .

[46]  Achim G. Hoffmann,et al.  A New Approach for Scientific Citation Classification Using Cue Phrases , 2003, Australian Conference on Artificial Intelligence.

[47]  M. Moravcsik,et al.  Some Results on the Function and Quality of Citations , 1975 .

[48]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[49]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[50]  Simone Teufel,et al.  An annotation scheme for citation function , 2009, SIGDIAL Workshop.

[51]  J Skelton Analysis of the structure of original research papers: an aid to writing original papers for publication. , 1994, The British journal of general practice : the journal of the Royal College of General Practitioners.

[52]  Geoffrey Leech,et al.  100 Million Words of English:The British National Corpus (BNC) , 1992 .

[53]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[54]  H. D. White Citation Analysis and Discourse Analysis Revisited. , 2004 .

[55]  Stephen E. Robertson,et al.  Comparing citation contexts for information retrieval , 2008, CIKM '08.

[56]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[57]  J. Neter,et al.  Applied Linear Regression Models , 1983 .

[58]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[59]  Bluma C. Peritz,et al.  A classification of citation roles for the social sciences and related fields , 1983, Scientometrics.

[60]  Albert-László Barabási,et al.  Quantifying Long-Term Scientific Impact , 2013, Science.

[61]  E. Goffman Frame analysis: An essay on the organization of experience , 1974 .

[62]  Dragomir R. Radev,et al.  The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics , 2008, LREC.

[63]  Chaomei Chen,et al.  Where are citations located in the body of scientific articles? A study of the distributions of citation locations , 2013, J. Informetrics.

[64]  Jure Leskovec,et al.  Citing for high impact , 2010, JCDL '10.

[65]  J. Ziman,et al.  Public knowledge. An essay concerning the social dimension of science , 1970, Medical History.

[66]  J. Moody The Structure of a Social Science Collaboration Network: Disciplinary Cohesion from 1963 to 1999 , 2004 .

[67]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[68]  Ying Ding,et al.  The distribution of references across texts: Some implications for citation analysis , 2013, J. Informetrics.

[69]  G. Gilbert Referencing as Persuasion , 1977 .

[70]  David I. Stern,et al.  High-Ranked Social Science Journal Articles Can Be Identified from Early Citation Information , 2014, PloS one.

[71]  Andrea Bergmann,et al.  Citation Indexing Its Theory And Application In Science Technology And Humanities , 2016 .

[72]  Robert E. Mercer,et al.  Towards an Automated Citation Classifier , 2000, Canadian Conference on AI.

[73]  Xiaojun Wan,et al.  Are all literature citations equally important? Automatic citation strength estimation and its applications , 2014, J. Assoc. Inf. Sci. Technol..

[74]  Nitesh V. Chawla,et al.  Can Scientific Impact Be Predicted? , 2016, IEEE Transactions on Big Data.

[75]  José M. Gómez,et al.  Survey about citation context analysis: Tasks, techniques, and resources , 2015, Natural Language Engineering.

[76]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[77]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[78]  Henry G. Small,et al.  Interpreting maps of science using citation context sentiments: a preliminary investigation , 2011, Scientometrics.

[79]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.