论文信息 - Language Independent System for Definition Extraction: First Results Using Learning Algorithms

Language Independent System for Definition Extraction: First Results Using Learning Algorithms

In this paper we report on the performance of different learning algorithms and different sampling technique applied to a definition extraction task, using data sets in different language. We compare our results with those obtained by handcrafted rules to extract definitions. When Definition Extraction is handled with machine learning algorithms, two different issues arise. On the one hand, in most cases the data set used to extract definitions is unbalanced, and this means that it is necessary to deal with this characteristic with specific techniques. On the other hand it is possible to use the same methods to extract definitions from documents in different corpus, making the classifier language independent.

António Branco | Rosa Del Gaudio

[1] Mark Sanderson,et al. Retrieving descriptive phrases from large amounts of free text , 2000, CIKM '00.

[2] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[3] Andrew P. Bradley,et al. The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[4] Horacio Saggion. Identifying Definitions in Text Collections for Question Answering , 2004, LREC.

[5] Geoff Barnbrook,et al. Briefly noted: defining language: A local grammar of definition sentences , 2002 .

[6] Adam Przepiórkowski,et al. Definition Extraction with Balanced Random Forests , 2008, GoTAL.

[7] Jennifer Pearson. The Expression of Definitions in Specialised Texts: a Corpus-based Analysis , 1996 .

[8] Nitesh V. Chawla,et al. SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9] Ana L. C. Bazzan,et al. Balancing Training Data for Automated Annotation of Keywords: a Case Study , 2003, WOB.

[10] Gustavo E. A. P. A. Batista,et al. A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[11] Eline Westerhout,et al. Extraction of Dutch definitory contexts for eLearning purposes , 2007 .