Concept-based analysis of scientific literature

This paper studies the importance of identifying and categorizing scientific concepts as a way to achieve a deeper understanding of the research literature of a scientific community. To reach this goal, we propose an unsupervised bootstrapping algorithm for identifying and categorizing mentions of concepts. We then propose a new clustering algorithm that uses citations' context as a way to cluster the extracted mentions into coherent concepts. Our evaluation of the algorithms against gold standards shows significant improvement over state-of-the-art results. More importantly, we analyze the computational linguistic literature using the proposed algorithms and show four different ways to summarize and understand the research community which are difficult to obtain using existing techniques.

[1]  Noah A. Smith,et al.  Discovering Factions in the Computational Linguistics Community , 2012, Discoveries@ACL.

[2]  Christopher D. Manning,et al.  Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers , 2011, IJCNLP.

[3]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[4]  Cheng Niu,et al.  A Bootstrapping Approach to Named Entity Classification Using Successive Learners , 2003, ACL.

[5]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Dan Roth,et al.  The Use of Classifiers in Sequential Inference , 2001, NIPS.

[7]  Brian Roark,et al.  Noun-Phrase Co-Occurence Statistics for Semi-Automatic Semantic Lexicon Construction , 1998, COLING-ACL.

[8]  Ellen Riloff,et al.  Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing , 2010, ACL.

[9]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[10]  Ellen Riloff,et al.  A Corpus-Based Approach for Building Semantic Lexicons , 1997, EMNLP.

[11]  Brian Roark,et al.  Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction , 2000, COLING.

[12]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[13]  Ralph Grishman,et al.  Unsupervised Learning of Generalized Names , 2002, COLING.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[16]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[17]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[18]  Dragomir R. Radev,et al.  Rediscovering ACL Discoveries Through the Lens of ACL Anthology Network Citing Sentences , 2012, Discoveries@ACL.

[19]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.