Low-Resource Active Learning of Morphological Segmentation

Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection. Northern European Journal of Language Technology, 2016, Vol. 4, Article 4, pp 47–72 DOI 10.3384/nejlt.2000-1533.1644

[1]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[2]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[3]  Tommi A. Pirinen,et al.  HFST Tools for Morphology - An Efficient Open-Source Package for Construction of Morphological Analyzers , 2009, SFCM.

[4]  Francis M. Tyers,et al.  Developing Prototypes for Machine Translation between Two Sami Languages , 2009, EAMT.

[5]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[6]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[7]  Sonja E. Bosch,et al.  Experimental Fast-Tracking of Morphological Analysers for Nguni Languages , 2008, LREC.

[8]  M. McShane,et al.  Bootstrapping Morphological Analyzers by Combining Human Elicitation and Machine Learning , 2001, Computational Linguistics.

[9]  David Yarowsky,et al.  Minimally Supervised Morphological Analysis by Multimodal Alignment , 2000, ACL.

[10]  Xiaowei Xu,et al.  Representative Sampling for Text Classification Using Support Vector Machines , 2003, ECIR.

[11]  Kristiina Jokinen,et al.  Multimodal Open-Domain Conversations with the Nao Robot , 2012, Natural Interaction with Robots, Knowbots and Smartphones, Putting Spoken Dialog Systems into Practice.

[12]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[13]  Mikko Kurimo,et al.  Morfessor FlatCat: An HMM-Based Method for Unsupervised and Semi-Supervised Learning of Morphology , 2014, COLING.

[14]  Mathias Creutz,et al.  Morpheme Segmentation Gold Standards for Finnish and English , 2004 .

[15]  Pekka Sammallahti,et al.  The Saami languages : an introduction , 1999 .

[16]  Kristiina Jokinen,et al.  Community-based resource building and data collection , 2014, SLTU.

[17]  Mathias Creutz,et al.  Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner , 2007, MTSUMMIT.

[18]  Mikko Kurimo,et al.  Unsupervised Morpheme Analysis Evaluation by IR experiments - Morpho Challenge 2007 , 2007, CLEF.

[19]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[20]  Mikko Kurimo,et al.  Towards SamiTalk: A Sami-Speaking Robot Linked to Sami Wikipedia , 2016, IWSDS.

[21]  TROND TROSTERUD,et al.  Consonant Gradation in Estonian and Sámi : Two-Level Solution , 2005 .

[22]  Mikko Kurimo,et al.  Empirical Comparison of Evaluation Methods for Unsupervised Learning of Morphology , 2011, TAL.

[23]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[24]  Hoifung Poon,et al.  Unsupervised Morphological Segmentation with Log-Linear Models , 2009, NAACL.

[25]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[26]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[27]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[28]  Lars Borin,et al.  Unsupervised Learning of Morphology , 2011, CL.

[29]  Mikko Kurimo,et al.  Overview and Results of Morpho Challenge 2009 , 2009, CLEF.

[30]  Mikko Kurimo,et al.  Low-Resource Active Learning of North Sámi Morphological Segmentation , 2015 .

[31]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[32]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[33]  Mikko Kurimo,et al.  Painless Semi-Supervised Morphological Segmentation using Conditional Random Fields , 2014, EACL.

[34]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[35]  Isabelle Guyon,et al.  Results of the Active Learning Challenge , 2011, Active Learning and Experimental Design @ AISTATS.

[36]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[37]  Esa Toom,et al.  Kotimaisten kielten tutkimuskeskus , 2004 .

[38]  Mathias Creutz,et al.  Induction of a Simple Morphology for Highly-Inflecting Languages , 2004, SIGMORPHON@ACL.

[39]  Mark Fishel,et al.  Linguistically Motivated Unsupervised Segmentation for Machine Translation , 2010, LREC.

[40]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[41]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[42]  Mathias Creutz,et al.  INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT , 2005 .

[43]  Mikko Kurimo,et al.  Tuning Phrase-Based Segmented Translation for a Morphologically Complex Target Language , 2015, WMT@EMNLP.

[44]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[45]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[46]  Sharon Goldwater,et al.  Minimally-Supervised Morphological Segmentation using Adaptor Grammars , 2013, TACL.

[47]  Kristiina Jokinen,et al.  Open-domain Interaction and Online Content in the Sami Language , 2014, LREC.

[48]  Mikko Kurimo,et al.  A Comparative Study of Minimally Supervised Morphological Segmentation , 2016, CL.

[49]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[50]  Andrew McCallum,et al.  Active Learning by Labeling Features , 2009, EMNLP.

[51]  Oskar Kohonen,et al.  Semi-Supervised Learning of Concatenative Morphology , 2010, SIGMORPHON.