Entry Generation by Analogy – Encoding New Words for Morphological Lexicons

Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. To add new words to a lexicon, we need to indicate their base form and inflectional paradigm. In this article, we evaluate a combination of corpus-based and lexicon-based methods for assigning the base form and inflectional paradigm to new words in Finnish, Swedish and English finite-state transducer lexicons. The methods have been implemented with the open-source Helsinki Finite-State Technology (Lindén & al., 2009). As an entry generator often produces numerous suggestions, it is important that the best suggestions be among the first few, otherwise it may become more efficient to create the entries by hand. By combining the probabilities calculated from corpus data and from lexical data, we get a more precise combined model. The combined method has 77-81 % precision and 89-97 % recall, i.e. the first correctly generated entry is on the average found as the first or second candidate for the test languages. A further study demonstrated that a native speaker could revise suggestions from the entry generator at a speed of 300-400 entries per hour.

[1]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[2]  François Yvon,et al.  An Analogical Learner for Morphological Analysis , 2005, CoNLL.

[3]  Robert R. Hoffman,et al.  Monster Analogies , 1995, AI Mag..

[4]  Krister Lindén,et al.  Guessers for Finite-State Transducer Lexicons , 2009, CICLing.

[5]  Andrei Mikheev Unsupervised Learning of Word-Category Guessing Rules , 1996, ACL.

[6]  Timothy Baldwin,et al.  Bootstrapping Deep Lexical Resources: Resources for Courses , 2005, ACL 2005.

[7]  François Yvon,et al.  Formal Models of Analogical Proportions , 2007 .

[8]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[9]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner , 2007 .

[10]  Krister Lindén,et al.  Corpus-based Paradigm Selection for Morphological Entries , 2009, NODALIDA.

[11]  David Yarowsky,et al.  Modeling and learning multilingual inflectional morphology in a minimally supervised framework , 2003 .

[12]  Krister Lindén,et al.  Multilingual modeling of cross-lingual spelling variants , 2006, Information Retrieval.

[13]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[14]  Yves Lepage,et al.  Analogy and Formal Languages , 2004, FGMOL.

[15]  John Goldsmith,et al.  Segmentation and morphology , 2010 .

[16]  Markus Forsberg,et al.  Morphological Lexicon Extraction from Raw Text Data , 2006, FinTAL.

[17]  M. McShane,et al.  Bootstrapping Morphological Analyzers by Combining Human Elicitation and Machine Learning , 2001, Computational Linguistics.

[18]  Richard Wicentowski Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model , 2004, SIGMORPHON@ACL.

[19]  James Jay Horning,et al.  A study of grammatical inference , 1969 .

[20]  LAURI CARLSON 2 Inducing a Morphological Transducer from Inflectional Paradigms , 2005 .

[21]  Royal Skousen,et al.  Analogical Modeling Of Language , 1989 .

[22]  Andrei Mikheev,et al.  Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[23]  Yves Lepage,et al.  Purest ever example-based machine translation: Detailed presentation and assessment , 2005, Machine Translation.

[24]  D. Gentner,et al.  Analogical Learning in Negotiation Teams : Comparing Cases Promotes Learning and Transfer , 2003 .

[25]  Peter D. Turney A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations , 2008, COLING.

[26]  Emmanuel Keuleers,et al.  Dutch plural inflection: The exception that proves the analogy , 2007, Cognitive Psychology.

[27]  Marie-Claude L’Homme Structuring Terminology using Anal- ogy-Based Machine learning , 2005 .

[28]  Jacques Sakarovitch,et al.  Introducing VAUCANSON , 2004, Theor. Comput. Sci..

[29]  Petra Barg,et al.  Processing Unknown Words in HPSG , 1998, ACL.

[30]  Brigham Young Paradigm Uniformity and Analogy : The Capitalistic versus Militaristic Debate , 2007 .

[31]  Krister Lindén Assigning an Inflectional Paradigm using the Longest Matching Affix , 2008 .

[32]  Vincent Claveau,et al.  Automatic Morphological Query Expansion Using Analogy-Based Machine Learning , 2007, ECIR.

[33]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[34]  Dedre Gentner,et al.  Analogical Encoding: Facilitating Knowledge Transfer and Integration , 2004 .

[35]  Marco Baroni,et al.  Unsupervised discovery of morphologically related words based on orthographic and semantic similarity , 2002, SIGMORPHON.

[36]  Yves Lepage,et al.  Solving Analogies on Words: An Algorithm , 1998, COLING-ACL.

[37]  Yves Lepage Languages Of Analogical Strings , 2000, COLING.

[38]  Krister Lindén,et al.  A Probabilistic Model for Guessing Base Forms of New Words by Analogy , 2008, CICLing.

[39]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[40]  Kenneth J. Kurtz,et al.  Converging on a new role for analogy in problem solving and retrieval: when two problems are better than one , 2007, Memory & cognition.

[41]  Tommi A. Pirinen,et al.  HFST Tools for Morphology - An Efficient Open-Source Package for Construction of Morphological Analyzers , 2009, SFCM.

[42]  Petra Barg,et al.  Incremental Identification of Inflectional Types , 2000, COLING.

[43]  Ann A. Copestake,et al.  The ACQUILEX LKB: representation issues in semi-automatic acquisition of large lexicons , 1992, ANLP.

[44]  David Yarowsky,et al.  Minimally Supervised Morphological Analysis by Multimodal Alignment , 2000, ACL.

[45]  Royal Skousen Analogical Modeling: Exemplars, Rules, and Quantum Computing , 2003 .

[46]  Krister Lindén,et al.  Corpus-Based Lexeme Ranking for Morphological Guessers , 2009, SFCM.

[47]  Mikko Kurimo,et al.  Overview of Morpho Challenge in CLEF 2007 , 2007, CLEF.

[48]  John Goldsmith,et al.  Morphological Analogy: Only a Beginning , 2008 .

[49]  Christopher D. Manning,et al.  DEMOS , 2009 .

[50]  Mathias Creutz,et al.  Morfessor and Hutmegs: Unsupervised Morpheme Segmentation for Highly-Inflecting and Compounding Languages , 2005 .