Combining a rule-based approach and machine learning in a good-example extraction task for the purpose of lexicographic work on contemporary standard German

The work we will present in this paper is part of a dictionary project at the Berlin-Brandenburg Academy of Sciences and Humanities. For a large number of headwords, example sentences for their respective lexicographic descriptions have to be retrieved from a corpus of contemporary German. Lexicographers are typically faced with a huge number of corpus citations. Therefore, a tool that selects only good examples (those which are considered for inclusion into the dictionary) and dismisses the other ones would be time and effort effective. A rule-based good-example extractor proved to offer a good starting point, but the tool still delivers too many inacceptable citations. We have therefore tried to combine this tool with a machine learner that is trained on the decisions of an experienced lexicographer. The learner has been optimized to reject a large share of the example sentences. We present the machine learning results on a test data set with various combinations of linguistic features and quantify the gain in time and effort for the lexicographers. We also discuss the shortcomings of our approach and suggest some measures to counter them.

[1]  B. T. S. Atkins,et al.  The Oxford Guide to Practical Lexicography , 2008 .

[2]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[5]  Wolfgang Klein,et al.  Das Digitale Wörterbuch der Deutschen Sprache (DWDS) , 2010 .

[6]  Timothy Baldwin,et al.  Applying a Word-sense Induction System to the Automatic Extraction of Diverse Dictionary Examples , 2014 .

[7]  Adam Kilgarriff,et al.  GDEX: Automatically Finding Good Dictionary Examples in a Corpus , 2008 .

[8]  Christopher D. Manning,et al.  Parsing Three German Treebanks: Lexicalized and Unlexicalized Baselines , 2008 .

[9]  Lothar Lemnitzer,et al.  Using Google books unigrams to improve the update of large monolingual reference dictionaries. , 2012 .

[10]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[11]  K. Kraus,et al.  The DWDS corpus: A reference corpus for the German language of the 20 century , 2006 .

[12]  Lothar Lemnitzer,et al.  Automatic example sentence extraction for a contemporary German dictionary , 2012 .

[13]  Richard Johansson,et al.  Semi-automatic selection of best corpus examples for Swedish: Initial algorithm evaluation , 2012 .

[14]  Ingo Mierswa,et al.  Non-Convex and Multi-Objective Optimization in Data Mining - Non-Convex and Multi-Objective Optimization for Statistical Learning and Numerical Feature Engineering , 2009 .

[15]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .