Unsupervised segmentation of words into morphemes - Challenge 2005, An Introduction and Evaluation Report

The objective of the challenge for the unsupervised segmentation of words into morphemes, or shorter the Morpho Challenge, was to design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes. Ideally, these are basic vocabulary units suitable for different tasks, such as speech and text understanding, machine translation, information retrieval, and statistical language modeling. The segmentations were evaluated in two complementary ways: Competition 1: The proposed morpheme segmentation were compared to a linguistic morpheme segmentation gold standard. Competition 2: Speech recognition experiments were performed, where statistical n-gram language models utilized the proposed word segments instead of entire words. Data sets were provided for three languages: Finnish, English, and Turkish. Participants were encouraged to apply their algorithm to all of these test languages.

[1]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[2]  Mathias Creutz Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency , 2003, ACL.

[3]  Eric Atwell,et al.  Customising a Copying-Identifier for Biomedical Science Student Reports: Comparing Simple and Smart Analyses , 2002, AICS.

[4]  Weblog Wikipedia,et al.  In Wikipedia the Free Encyclopedia , 2005 .

[5]  Kostas Tzeras,et al.  Automatic indexing based on Bayesian inference networks , 1993, SIGIR.

[6]  Uwe Quasthoff Projekt Der Deutsche Wortschatz , 1997, GLDV-Jahrestagung.

[7]  Mathias Creutz,et al.  Induction of a Simple Morphology for Highly-Inflecting Languages , 2004, SIGMORPHON@ACL.

[8]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Inflectional Morphologies , 2001, NAACL.

[9]  Sean A. Fulop,et al.  Unsupervised Learning of Morphology Without Morphemes , 2002, SIGMORPHON.

[10]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[11]  Z. Harris From Phoneme to Morpheme , 1955 .

[12]  Stefan Schulz,et al.  Biomedical text retrieval in languages with a complex morphology , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[13]  Vesa Siivola,et al.  Growing an n-gram language model , 2005, INTERSPEECH.

[14]  Eric Atwell,et al.  Is anybody out there? the detection of intelligent and generic language-like features , 2000 .

[15]  Petra Geutner,et al.  Using morphology towards better large-vocabulary speech recognition systems , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[16]  Xiaotie Deng,et al.  Unsupervised Segmentation of Chinese Corpus Using Accessor Variety , 2004, IJCNLP.

[17]  R. H. Baayen,et al.  The CELEX Lexical Database (CD-ROM) , 1996 .

[18]  Christopher J. Fox,et al.  Strength and similarity of affix removal stemming algorithms , 2003, SIGF.

[19]  E. Newport,et al.  WORD SEGMENTATION : THE ROLE OF DISTRIBUTIONAL CUES , 1996 .

[20]  Christian Biemann,et al.  Rigorous dimensionality reduction through linguistically motivated feature selection for text categorization , 2005, NODALIDA.

[21]  Young-Suk Lee,et al.  Morphological Analysis for Statistical Machine Translation , 2004, NAACL.

[22]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[23]  U. Quasthoff,et al.  The Poisson Collocation Measure and its Applications , 2002 .

[24]  Matthew G. Snover,et al.  A Probabilistic Model for Learning Concatenative Morphology , 2002, NIPS.

[25]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner , 2003, INTERSPEECH.

[26]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[27]  Mathias Creutz,et al.  INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT , 2005 .

[28]  Ebru Arisoy,et al.  Unlimited vocabulary speech recognition for agglutinative languages , 2006, NAACL.

[29]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[30]  Charles D. Yang Universal Grammar, statistics or both? , 2004, Trends in Cognitive Sciences.

[31]  Vibhu O. Mittal,et al.  Stemming and its effects on TFIDF ranking (poster session) , 2000, SIGIR '00.

[32]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[33]  Janne Pylkkönen New pruning criteria for efficient decoding , 2005, INTERSPEECH.

[34]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[35]  Pierre Zweigenbaum,et al.  Acquiring meaning for French medical terminology: contribution of morphosemantics , 2004, MedInfo.

[36]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[37]  Stefan Bordag,et al.  Unsupervised Knowledge-Free Morpheme Boundary Detection , 2005 .

[38]  øöö Blockinø Knowledge-Free Induction of Morphology Using Latent Semantic Analysis , 2000 .

[39]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[40]  Howard L. Bleich,et al.  Conceptual mapping of user's queries to medical subject headings , 1997, AMIA.

[41]  G. Clark,et al.  Reference , 2008 .

[42]  Natalia Grabar,et al.  Liens morphologiques et structuration de terminologie , 2000 .

[43]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[44]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[45]  Mathias Creutz,et al.  Morpheme Segmentation Gold Standards for Finnish and English , 2004 .

[46]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[47]  David Yarowsky,et al.  Minimally Supervised Induction of Grammatical Gender , 2003, HLT-NAACL.

[48]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[49]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[50]  Eric Atwell,et al.  Detecting student copying in a corpus of science laboratory reports , 2003 .

[51]  Herv Morphemes as Necessary Concept for Structures Discovery from Untagged Corpora , 1998 .

[52]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[53]  Stephen F. Weiss,et al.  Word segmentation by letter successor varieties , 1974, Inf. Storage Retr..