Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

This paper presents DefMiner, a supervised sequence labeling system that identifies scientific terms and their accompanying definitions. DefMiner achieves 85% F1 on a Wikipedia benchmark corpus, significantly improving the previous state-of-the-art by 8%. We exploit DefMiner to process the ACL Anthology Reference Corpus (ARC) ‐ a large, real-world digital library of scientific articles in computational linguistics. The resulting automatically-acquired glossary represents the terminology defined over several thousand individual research articles. We highlight several interesting observations: more definitions are introduced for conference and workshop papers over the years and that multiword terms account for slightly less than half of all terms. Obtaining a list of popular defined terms in a corpus of computational linguistics papers, we find that concepts can often be categorized into one of three categories: resources, methodologies and evaluation metrics.

[1]  Ulrich Schäfer,et al.  Extracting glossary sentences from scholarly articles: A comparative evaluation of pattern bootstrapping and deep analysis , 2012, Discoveries@ACL.

[2]  Smaranda Muresan,et al.  A Method for Automatically Building and Evaluating Dictionary Resources , 2002, LREC.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[5]  Paola Velardi,et al.  Learning Word-Class Lattices for Definition and Hypernym Extraction , 2010, ACL.

[6]  Dragomir R. Radev,et al.  The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics , 2008, LREC.

[7]  E. N. Westerhout,et al.  Definition Extraction using Linguistic and Structural Features , 2009 .

[8]  Eline Westerhout,et al.  Extraction of Dutch definitory contexts for eLearning purposes , 2007 .

[9]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[10]  Daniel Jurafsky,et al.  Parsing to Stanford Dependencies: Trade-offs between Speed and Accuracy , 2010, LREC.

[11]  Gordon J. Pace,et al.  Evolutionary Algorithms for Definition Extraction , 2009 .

[12]  Gosse Bouma,et al.  Learning to Identify Definitions using Syntactic Features , 2006, Learning Structured Information@EACL.

[13]  Jennifer Pearson The Expression of Definitions in Specialised Texts: a Corpus-based Analysis , 1996 .