Definition Extraction using Linguistic and Structural Features

In this paper a combination of linguistic and structural information is used for the extraction of Dutch definitions. The corpus used is a collection of Dutch texts on computing and elearning containing 603 definitions. The extraction process consists of two steps. In the first step a parser using a grammar defined on the basis of the patterns observed in the definitions is applied on the complete corpus. Machine learning is thereafter applied to improve the results obtained with the grammar. The experiments show that using a combination of linguistic (n-grams, type of article, type of noun) and structural information (layout, position) is a promising approach to the definition extraction task.

[1]  Adam Przepiórkowski,et al.  Dealing with Small, Noisy and Imbalanced Data Machine Learning or Manual Grammars? , 2008 .

[2]  Adam Przepiórkowski,et al.  Dealing with Small, Noisy and Imbalanced Data , 2008, TSD.

[3]  Ion Androutsopoulos,et al.  Learning to Identify Single-Snippet Answers to Definition Questions , 2004, COLING.

[4]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[5]  Sasha Blair-Goldensohn,et al.  Answering Definitional Questions: A Hybrid Approach , 2004, New Directions in Question Answering.

[6]  Smaranda Muresan,et al.  A Method for Automatically Building and Evaluating Dictionary Resources , 2002, LREC.

[7]  Eline Westerhout,et al.  Creating Glossaries Using Pattern-Based and Machine Learning Techniques , 2008, LREC.

[8]  Adam Przepiórkowski,et al.  Definition Extraction Using a Sequential Combination of Baseline Grammars and Machine Learning Classifiers , 2008, LREC.

[9]  R. Bekkerman,et al.  Using Bigrams in Text Categorization , 2003 .

[10]  Angelika Storrer,et al.  Automated detection and annotation of term definitions in German text corpora , 2006, LREC.

[11]  Gosse Bouma,et al.  Learning to Identify Definitions using Syntactic Features , 2006, Learning Structured Information@EACL.

[12]  Mark T. Maybury New Directions in Question Answering , 2004 .

[13]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[14]  Adam Przepiórkowski,et al.  Towards the Automatic Extraction of Definitions in Slavic , 2007, ACL 2007.

[15]  A. Branco,et al.  Extraction of Definitions in Portuguese: An Imbalanced Data Set Problem , 2009 .

[16]  Eline Westerhout Combining pattern-based and machine learning methods to detect definitions for eLearning purposes , 2007 .

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[19]  Eline Westerhout,et al.  Extraction of Definitions Using Grammar-Enhanced Machine Learning , 2009, EACL.

[20]  Eline Westerhout,et al.  Extraction of Dutch definitory contexts for eLearning purposes , 2007 .

[21]  Horacio Saggion Identifying Definitions in Text Collections for Question Answering , 2004, LREC.

[22]  Manfred Pinkal,et al.  Automatic Extraction of Definitions from German Court Decisions , 2006 .

[23]  Adam Przepiórkowski,et al.  Definition Extraction with Balanced Random Forests , 2008, GoTAL.