Identification of Multiword Expressions for Latvian and Lithuanian: Hybrid Approach

We discuss an experiment on automatic identification of bi-gram multi-word expressions in parallel Latvian and Lithuanian corpora. Raw corpora, lexical association measures (LAMs) and supervised machine learning (ML) are used due to deficit and quality of lexical resources (e.g., POS-tagger, parser) and tools. While combining LAMs with ML is rather effective for other languages, it has shown some nice results for Lithuanian and Latvian as well. Combining LAMs with ML we have achieved 92,4% precision and 52,2% recall for Latvian and 95,1% precision and 77,8% recall for Lithuanian.

[1]  Leonardo Zilio,et al.  Automatic extraction and evaluation of MWE , 2011, STIL.

[2]  Allen Kent,et al.  Machine Literature Searching , 2012 .

[3]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[4]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[5]  Pavel Pecina,et al.  Combining Association Measures for Collocation Extraction , 2006, ACL.

[6]  Carlos Ramisch,et al.  Multiword Expressions Acquisition: A Generic and Open Framework , 2014 .

[7]  Pavel Pecina AMachine Learning Approach to Multiword Expression Extraction , 2008 .

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Allen Kent,et al.  Machine literature searching X. Machine language; factors underlying its design and development , 1955 .

[11]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[12]  Joakim Nivre,et al.  Extraction of Nominal Multiword Expressions in French , 2014, MWE@EACL.

[13]  Serge Sharoff,et al.  What is at Stake: a Case Study of Russian Expressions Starting with a Preposition , 2004 .

[14]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[15]  S. Evert A Lexicographic Evaluation of German Adjective-Noun Collocations , 2008 .

[16]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[17]  Stan Matwin,et al.  Discriminative parameter learning for Bayesian networks , 2008, ICML '08.