Language Independent System for Definition Extraction: First Results Using Learning Algorithms

In this paper we report on the performance of different learning algorithms and different sampling technique applied to a definition extraction task, using data sets in different language. We compare our results with those obtained by handcrafted rules to extract definitions. When Definition Extraction is handled with machine learning algorithms, two different issues arise. On the one hand, in most cases the data set used to extract definitions is unbalanced, and this means that it is necessary to deal with this characteristic with specific techniques. On the other hand it is possible to use the same methods to extract definitions from documents in different corpus, making the classifier language independent.

[1]  Mark Sanderson,et al.  Retrieving descriptive phrases from large amounts of free text , 2000, CIKM '00.

[2]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[3]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[4]  Horacio Saggion Identifying Definitions in Text Collections for Question Answering , 2004, LREC.

[5]  Geoff Barnbrook,et al.  Briefly noted: defining language: A local grammar of definition sentences , 2002 .

[6]  Adam Przepiórkowski,et al.  Definition Extraction with Balanced Random Forests , 2008, GoTAL.

[7]  Jennifer Pearson The Expression of Definitions in Specialised Texts: a Corpus-based Analysis , 1996 .

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  Ana L. C. Bazzan,et al.  Balancing Training Data for Automated Annotation of Keywords: a Case Study , 2003, WOB.

[10]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[11]  Eline Westerhout,et al.  Extraction of Dutch definitory contexts for eLearning purposes , 2007 .

[12]  Pedro Martins,et al.  Supporting e-Learning with Language Technology for Portuguese , 2008, PROPOR.

[13]  Pierre Zweigenbaum,et al.  Detecting Semantic Relations between Terms in Definitions , 2004 .

[14]  Adam Przepiórkowski,et al.  Dealing with Small, Noisy and Imbalanced Data Machine Learning or Manual Grammars? , 2008 .

[15]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[16]  Adam Przepiórkowski,et al.  Dealing with Small, Noisy and Imbalanced Data , 2008, TSD.

[17]  Débora Oliveira,et al.  Extracção de definições no Corpógrafo , 2004 .

[18]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[19]  Ian Witten,et al.  Data Mining , 2000 .

[20]  Qinghua Zheng,et al.  Offline Definition Extraction Using Machine Learning for Knowledge-Oriented Question Answering , 2007, ICIC.

[21]  Andreas Stolcke,et al.  A study in machine learning from imbalanced data for sentence boundary detection in speech , 2006, Comput. Speech Lang..

[22]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[23]  Gosse Bouma,et al.  Developing Offline Strategies for Answering Medical Questions , 2005 .

[24]  J. Silva Shallow processing of portuguese: from sentence chunking to nominal lemmatization , 2007 .

[25]  Gernot Hebenstreit Defining patterns in Translation Studies: revisiting two classics of German Translationswissenschaft , 2007 .

[26]  Ion Androutsopoulos,et al.  Learning to Identify Single-Snippet Answers to Definition Questions , 2004, COLING.

[27]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[28]  Gosse Bouma,et al.  Learning to Identify Definitions using Syntactic Features , 2006, Learning Structured Information@EACL.

[29]  Smaranda Muresan,et al.  Evaluation of the DEFINDER system for fully automatic glossary construction , 2001, AMIA.