Unsupervised Morpheme Analysis Evaluation by IR experiments - Morpho Challenge 2007

This paper presents the evaluation and results of Competition 2 (information retrieval experiments) in the Morpho Challenge 2008. Competition 1 (a comparison to linguistic gold standard) is described in a companion paper. In Morpho Challenge 2008 the goal was to search and evaluate unsupervised machine learning algorithms that provide morpheme analysis for words in dierent languages. The morpheme analysis can be important in several applications, where a large vocabulary is needed. Especially in morphologically complex languages, such as Finnish, Turkish and Arabic, the agglutination, inflection, and compounding easily produces millions of dierent word forms which is clearly too much for building an eective vocabulary and training probabilistic models for the relations between words. The benefits of successful morpheme analysis can be seen, for example, in speech recognition, information retrieval, and machine translation. In Morpho Challenge 2008 the morpheme analysis submitted by the Challenge participants were evaluated by performing information retrieval experiments, where the words in the documents and queries were replaced by their proposed morpheme representations and the search was based on morphemes instead of words. The results indicate that the morpheme analysis has a significant eect in IR performance in all tested languages (Finnish, English and German). The best unsupervised and language-independent morpheme analysis methods can also rival the best language-dependent word normalization methods. The Morpho Challenge was part of the EU Network of Excellence PASCAL Challenge Program and organized in collaboration with CLEF.

[1]  Young-Suk Lee,et al.  Morphological Analysis for Statistical Machine Translation , 2004, NAACL.

[2]  Matti Varjokallio,et al.  Unsupervised Morpheme Analysis Evaluation by a Comparison to a Linguistic Gold Standard - Morpho Challenge 2007 , 2007, CLEF.

[3]  Ebru Arisoy,et al.  Unsupervised segmentation of words into morphemes - Challenge 2005, An Introduction and Evaluation Report , 2006 .

[4]  Jeff A. Bilmes,et al.  Factored Language Models and Generalized Parallel Backoff , 2003, NAACL.

[5]  Mathias Creutz,et al.  Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner , 2007, MTSUMMIT.

[6]  Mathias Creutz,et al.  Morpheme Segmentation Gold Standards for Finnish and English , 2004 .

[7]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[8]  Howard L. Bleich,et al.  Conceptual mapping of user's queries to medical subject headings , 1997, AMIA.

[9]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[10]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[11]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[12]  Mathias Creutz,et al.  INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT , 2005 .

[13]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[14]  Fei Xia,et al.  A Hybrid Approach to the Induction of Underlying Morphology , 2008, IJCNLP.