Improving Successor Variety for Morphological Segmentation

Successor variety is a commonly used measure for segmentation in language processing. It is based on a simple idea that a large variety of letters (or phonemes) following an initial word (or utterance) segment indicates a possible boundary. It dates back to Harris (1955), and several methods based on successor variety have been used in the literature, particularly for the purpose of segmenting words into morphemes. However, there have not been many studies analyzing the measure itself. Even though the idea is simple and effective, the current use in the literature does not utilize the measure to its full extent due to a number of problems with the successor variety scores. This paper intends to address these problems by introducing a normalization method, and demonstrates—using segmentation experiments on two typologically different languages— the effectiveness of this improvement on the morphological segmentation task.

[1]  Stefan Bordag,et al.  Unsupervised Knowledge-Free Morpheme Boundary Detection , 2005 .

[2]  Çağrı Çöltekin,et al.  A Freely Available Morphological Analyzer for Turkish , 2010, LREC.

[3]  CohenPaul,et al.  Voting experts: An unsupervised algorithm for segmenting sequences , 2007 .

[4]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[5]  Stefan Bordag Unsupervised and Knowledge-free Morpheme Segmentation and Analysis , 2007, CLEF.

[6]  M. Goldsmith,et al.  Statistical Learning by 8-Month-Old Infants , 1996 .

[7]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[8]  Paul R. Cohen,et al.  Voting experts: An unsupervised algorithm for segmenting sequences , 2007, Intell. Data Anal..

[9]  Mathias Creutz,et al.  Morpheme Segmentation Gold Standards for Finnish and English , 2004 .

[10]  Vera Demberg,et al.  A Language-Independent Unsupervised Model for Morphological Segmentation , 2007, ACL.

[11]  John Goldsmith,et al.  An algorithm for the unsupervised learning of morphology , 2006, Natural Language Engineering.

[12]  Mathias Creutz,et al.  INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT , 2005 .

[13]  Herv Morphemes as Necessary Concept for Structures Discovery from Untagged Corpora , 1998 .

[14]  Stephen F. Weiss,et al.  Word segmentation by letter successor varieties , 1974, Inf. Storage Retr..

[15]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[16]  Benno Stein,et al.  Putting Successor Variety Stemming to Work , 2006, GfKl.

[17]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[18]  R. H. Baayen,et al.  The CELEX Lexical Database (CD-ROM) , 1996 .

[19]  Zellig S. Harris,et al.  From Phoneme to Morpheme , 1955 .

[20]  Riyad Al-Shalabi,et al.  Experiments with the Successor Variety Algorithm Using the Cutoff and Entropy Methods , 2005 .

[21]  M. Brent Advances in the computational study of language acquisition , 1996, Cognition.

[22]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.