Definition Extraction with Balanced Random Forests

We propose a novel machine learning approach to the task of identifying definitions in Polish documents. Specifics of the problem domain and characteristics of the available dataset have been taken into consideration, by carefully choosing and adapting a classification method to highly imbalanced and noisy data. We evaluate the performance of a Random Forest-based classifier in extracting definitional sentences from natural language text and give a comparison with previous work.

[1]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[2]  Pierre Zweigenbaum,et al.  Detecting Semantic Relations between Terms in Definitions , 2004 .

[3]  Smaranda Muresan,et al.  DEFINDER: Rule-based Methods for the Extraction of Medical Terminology and their Associated Definitions from On-line Text , 2000, AMIA.

[4]  Adam Przepiórkowski,et al.  Definition Extraction Using a Sequential Combination of Baseline Grammars and Machine Learning Classifiers , 2008, LREC.

[5]  Violeta Seretan,et al.  Proceedings of The Third International Conference on Language Resources and Evaluation (LREC 2002) , 2002 .

[6]  Smaranda Muresan,et al.  Evaluation of the DEFINDER system for fully automatic glossary construction , 2001, AMIA.

[7]  Rodney D. Nielsen,et al.  Mixing Weak Learners in Semantic Parsin , 2004, EMNLP.

[8]  Peng Xu,et al.  Random forests and the data sparseness problem in language modeling , 2007, Comput. Speech Lang..

[9]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[10]  Smaranda Muresan,et al.  Evaluation of DEFINDER: a system to mine definitions from consumer-oriented medical text , 2001, JCDL '01.

[11]  Manfred Pinkal,et al.  Automatic Extraction of Definitions from German Court Decisions , 2006 .

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Claire Grover,et al.  In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC , 2006 .

[14]  Adam Przepiórkowski,et al.  Dealing with Small, Noisy and Imbalanced Data , 2008, TSD.

[15]  Adam Przepiórkowski,et al.  Dealing with Small, Noisy and Imbalanced Data Machine Learning or Manual Grammars? , 2008 .

[16]  Adam Przepiórkowski,et al.  Definition extraction: Improving Balanced Random Forests , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[17]  Ion Androutsopoulos,et al.  Learning to Identify Single-Snippet Answers to Definition Questions , 2004, COLING.

[18]  Angelika Storrer,et al.  Automated detection and annotation of term definitions in German text corpora , 2006, LREC.

[19]  Gosse Bouma,et al.  Learning to Identify Definitions using Syntactic Features , 2006, Learning Structured Information@EACL.

[20]  Adam Przepiórkowski,et al.  Towards the Automatic Extraction of Definitions in Slavic , 2007, ACL 2007.

[21]  Jennifer Pearson The Expression of Definitions in Specialised Texts: a Corpus-based Analysis , 1996 .