论文信息 - Identifying Focus , Techniques and Domain of Scientific Papers

Identifying Focus , Techniques and Domain of Scientific Papers

The dynamics of a research community can be studied by extracting information from its publications. We propose a system for extracting detailed information, such as main contribution, techniques used and the problems addressed, from scientific papers. Such information cannot be extracted using approaches that assume that words are independent of each other in a document. We use dependency trees, which give rich information about structure of a sentence, and extract relevant information from them by matching semantic patterns. We then study how the computational linguistics community and its sub-fields are changing over the years w.r.t. their focus, methods used and domain problems described in the papers. We get sub-fields of the community by using the topics obtained by applying Latent Dirichlet Allocation to text of the papers. We also find “innovative” phrases in each category for each year.

Christopher D. Manning | S. Gupta

[1] Marti A. Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[2] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3] Christopher D. Manning,et al. Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[4] Daniel Jurafsky,et al. Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[5] Dragomir R. Radev,et al. The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics , 2008, LREC.

[6] Sean Gerrish,et al. A Language-based Approach to Measuring Scholarly Impact , 2010, ICML.