Little by Little: Semi Supervised Stemming through Stem Set Minimization

In this paper we take an important step towards completely unsupervised stemming by giving a scheme for semi supervised stemming. The input to the system is a list of word forms and suffixes. The motivation of the work comes from the need to create a root or stem identifier for a language that has electronic corpora and some elementary linguistic work in the form of, say, suffix list. The scope of our work is suffix based morphology, (i.e., no prefix or infix morphology). We give two greedy algorithms for stemming. We have performed extensive experimentation with four languages: English, Hindi, Malayalam and Marathi. Accuracy figures ranges from 80% to 88% are reported for all languages.

[1]  Harald Hammarström,et al.  A Naive Theory of Affixation and an Algorithm for Extraction , 2006, SIGMORPHON.

[2]  Pushpak Bhattacharyya,et al.  Optimal Stem Identification in Presence of Suffix List , 2012, CICLing.

[3]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[4]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[5]  Joel D. Martin,et al.  Unsupervised Learning of Morphology for English and Inuktitut , 2003, NAACL.

[6]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[7]  Markus Dreyer,et al.  Graphical Models over Multiple Strings , 2009, EMNLP.

[8]  Alon Lavie,et al.  ParaMor and Morpho Challenge 2008 , 2008, CLEF.

[9]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[10]  Walter Daelemans,et al.  Memory-Based Morphological Analysis , 1999, ACL.

[11]  Vincent Ng,et al.  High-Performance, Language-Independent Morphological Segmentation , 2007, HLT-NAACL.

[12]  Alexander Clark Partially Supervised Learning of Morphology with Stochastic Transducers , 2001, NLPRS.

[13]  Gaja Jarosz,et al.  Unsupervised Learning of Morphology Using a Novel Directed Search Algorithm: Taking the First Step , 2002, SIGMORPHON.

[14]  Vincent Ng,et al.  Unsupervised morphological parsing of Bengali , 2006, Lang. Resour. Evaluation.

[15]  Lars Borin,et al.  Unsupervised Learning of Morphology , 2011, CL.

[16]  Harald Hammarström,et al.  Poor Man's Stemming: Unsupervised Recognition of Same-Stem Words , 2006, AIRS.