An Annotated Dataset for Extracting Definitions and Hypernyms from the Web

This paper presents and analyzes an annotated corpus of definitions, created to train an algorithm for the automatic extraction of definitions and hypernyms from web documents. As an additional resource, we also include a corpus of non-definitions with syntactic patterns similar to those of definition sentences, e.g.: ""An android is a robot"" vs. ""Snowcap is unmistakable"". Domain and style independence is obtained thanks to the annotation of a large and domain-balanced corpus and to a novel pattern generalization algorithm based on word-class lattices (WCL). A lattice is a directed acyclic graph (DAG), a subclass of nondeterministic finite state automata (NFA). The lattice structure has the purpose of preserving the salient differences among distinct sequences, while eliminating redundant information. The WCL algorithm will be integrated into an improved version of the GlossExtractor Web application (Velardi et al., 2008). This paper is mostly concerned with a description of the corpus, the annotation strategy, and a linguistic analysis of the data. A summary of the WCL algorithm is also provided for the sake of completeness.

[1]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[2]  Paola Velardi,et al.  Mining the Web to Create Specialized Glossaries , 2008, IEEE Intelligent Systems.

[3]  Qun Liu,et al.  Word Lattice Reranking for Chinese Word Segmentation and Part-of-Speech Tagging , 2008, COLING.

[4]  Gordon J. Pace,et al.  Evolutionary Algorithms for Definition Extraction , 2009 .

[5]  Smaranda Muresan,et al.  Generalizing Word Lattice Translation , 2008, ACL.

[6]  Antonio Sanfilippo,et al.  The Acquisition of Lexical Knowledge from Combined Machine-Readable Dictionary Sources , 1992, ANLP.

[7]  Paola Velardi,et al.  Learning Word-Class Lattices for Definition and Hypernym Extraction , 2010, ACL.

[8]  Peng Jiang,et al.  Automatic extraction of definitions , 2009, 2009 2nd IEEE International Conference on Computer Science and Information Technology.

[9]  Angelika Storrer,et al.  Automated detection and annotation of term definitions in German text corpora , 2006, LREC.

[10]  E. N. Westerhout,et al.  Definition Extraction using Linguistic and Structural Features , 2009 .

[11]  Gosse Bouma,et al.  Learning to Identify Definitions using Syntactic Features , 2006, Learning Structured Information@EACL.

[12]  Adam Przepiórkowski,et al.  Definition Extraction Using a Sequential Combination of Baseline Grammars and Machine Learning Classifiers , 2008, LREC.

[13]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[14]  Chris Dyer,et al.  Using a maximum entropy model to build segmentation lattices for MT , 2009, NAACL.

[15]  Adam Przepiórkowski,et al.  Towards the Automatic Extraction of Definitions in Slavic , 2007, ACL 2007.

[16]  William J. Byrne,et al.  Statistical Phrase-Based Speech Translation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Xabier Arregi,et al.  Extraction of semantic relations from a Basque monolingual dictionary using Constraint Grammar , 2000, ArXiv.

[18]  William M. Campbell,et al.  Language Recognition with Word Lattices and Support Vector Machines , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[19]  Philipp Koehn,et al.  Word Lattices for Multi-Source Translation , 2009, EACL.

[20]  Michael P. Oakes Using Hearst's Rules for the Automatic Acquisition of Hyponyms for Mining a Pharmaceutical Corpus , 2005, RANLP Text Mining Workshop.

[21]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[22]  Cristina Vertan Natural Language Processing and Knowledge Representation for eLearning Environments , 2007 .

[23]  Grace Hui Yang,et al.  A Metric-based Framework for Automatic Taxonomy Induction , 2009, ACL.

[24]  Oren Etzioni,et al.  What Is This, Anyway: Automatic Hypernym Discovery , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[25]  Lucy Vanderwende,et al.  Automatically Deriving Structured Knowledge Bases From On-Line Dictionaries , 1993 .

[26]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[27]  Silvia Bernardini,et al.  Introducing and evaluating ukWaC , a very large web-derived corpus of English , 2008 .

[28]  Bob Carpenter,et al.  Head-Driven Parsing for Word Lattices , 2004, ACL.

[29]  Sharon A. Caraballo Automatic construction of a hypernym-labeled noun hierarchy from text , 1999, ACL.

[30]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[31]  Eduard H. Hovy,et al.  Extending Metadata Definitions by Automatically Extracting and Organizing Glossary Definitions , 2003, DG.O.

[32]  António Branco,et al.  Automatic Extraction of Definitions in Portuguese: A Rule-Based Approach , 2007, EPIA Workshops.

[33]  Tat-Seng Chua,et al.  Soft pattern matching models for definitional question answering , 2007, TOIS.

[34]  Eline Westerhout,et al.  Extraction of Dutch definitory contexts for eLearning purposes , 2007 .