Using Data Mining and the CLARIN Infrastructure to Extend Corpus-based Linguistic Research

Large digital corpora of written language, such as those that are held by the CLARIN-D centers, provide excellent possibilities for linguistic research on authentic language data. Nonetheless, the large number of hits that can be retrieved from corpora often leads to challenges in concrete linguistic research settings. This is particularly the case, if the queried word-forms or constructions are (semantically) ambiguous. The joint project called ‘Corpus-based Linguistic Research and Analysis Using Data Mining’ (“Korpus-basierte linguistische Recherche und Analyse mit Hilfe von Data-Mining” – ‘KobRA’) is therefore underway to investigating the benefits and issues of using machine learning technologies in order to perform after-retrieval cleaning and disambiguation tasks automatically. The following article is an overview of the questions, methodologies and current results of the project, specifically in the scope of corpus-based lexicography/historical semantics. In this area, topic models were used in order to partition search result KWIC lists retrieved by querying various corpora for polysemous or homonym words by the individual meanings of these words.

[1]  Roberto Navigli,et al.  SemEval-2013 Task 11: Word Sense Induction and Disambiguation within an End-User Application , 2013, SemEval@NAACL-HLT.

[2]  Paul Rayson,et al.  Sense and semantic tagging , 2008 .

[3]  Anke Lüdeling,et al.  Corpus Linguistics: An International Handbook , 2009 .

[4]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[5]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[6]  Christian Biemann,et al.  Corpus Portal for Search in Monolingual Corpora , 2006, LREC.

[7]  Werner Wolski Engelberg, Stefan/Lothar Lemnitzer: Lexikographie und Wörterbuchbenutzung. Tübingen: Stauffenburg Verlag 2001 (277 S.) , 2005 .

[8]  R. Keller,et al.  Bedeutungswandel: Eine Einführung , 2003 .

[9]  Thomas Mayer,et al.  Towards Tracking Semantic Change by Visual Analytics , 2011, ACL.

[10]  Roberto Navigli,et al.  Inducing Word Senses to Improve Web Search Result Clustering , 2010, EMNLP.

[11]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[12]  Erhard W. Hinrichs,et al.  Automatic Annotation and Manual Evaluation of the Diachronic German Corpus TüBa-D/DC , 2012, LREC.

[13]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[15]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[16]  Tony McEnery,et al.  Corpus-Based Language Studies: An Advanced Resource Book , 2006 .

[17]  Theories of meaning change – an overview , 2022 .

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[20]  Wolfgang Klein,et al.  Das Digitale Wörterbuch der Deutschen Sprache (DWDS) , 2010 .

[21]  Mirella Lapata,et al.  Bayesian Word Sense Induction , 2009, EACL.

[22]  Stefan Engelberg,et al.  Lexikographie und Wörterbuchbenutzung , 2004 .

[23]  Robert L. Mercer,et al.  Word-Sense Disambiguation Using Statistical Methods , 1991, ACL.