论文信息 - Predicting corpus example quality via supervised machine learning

Predicting corpus example quality via supervised machine learning

In this paper we present a supervised-learning approach to extracting good dictionary examples from corpora.We train our predictor of quality on a dataset of corpus examples annotated with a four-level ordinal variable, ranging from a very bad to a very good example. Each of the examples is formally described through 23 variables; the dependence of the quality of which is modelled using a regression model. The evaluation of the ranked results for each of the collocations in the annotated dataset shows that we obtain precision on 10 top-ranked examples of ~80% and a precision of ~90% on the three top-ranked examples. Our approach is highly language independent as well, suffering almost no loss on the 10 top-ranked examples and a loss of ~4% on the three highest-ranked examples once the language-dependent and knowledge-source-dependent features are removed.

Nikola Ljubešić | Mario Peronja

[1] Adam Kilgarriff,et al. The Sketch Engine , 2004 .

[2] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[3] Adam Kilgarriff,et al. GDEX: Automatically Finding Good Dictionary Examples in a Corpus , 2008 .

[4] Simon Krek,et al. hrMWELex – a MWE lexicon of Croatian extracted from a parsed gigacorpus , 2014 .

[5] Iztok Kosem,et al. GDEX for Slovene , 2011 .

[6] Nikola Ljubesic,et al. {bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian , 2014, WaC@EACL.