Entry Generation for New Words by Analogy for Morphological Lexicons

Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. To add new words to a lexicon, we need to indicate their base form and inflectional paradigm. In this article, we evaluate a combination of corpus-based and lexicon-based methods for assigning the base form and inflectional paradigm to new words in Finnish, Swedish and English finite-state transducer lexicons. The methods have been implemented with the open-source Helsinki Finite-State Technology (Linden & al., 2009). As an entry generator often produces numerous suggestions, it is important that the best suggestions be among the first few, otherwise it may become more efficient to create the entries by hand. By combining the probabilities calculated from corpus data and from lexical data, we get a more precise combined model. The combined method has 77-81 % precision and 89-97 % recall, i.e. the first correctly generated entry is on the average found as the first or second candidate for the test languages. A further study demonstrated that a native speaker could revise suggestions from the entry generator at a speed of 300-400 entries per hour.

[1]  Krister Lindén,et al.  Multilingual modeling of cross-lingual spelling variants , 2006, Information Retrieval.

[2]  M. McShane,et al.  Bootstrapping Morphological Analyzers by Combining Human Elicitation and Machine Learning , 2001, Computational Linguistics.

[3]  Yves Lepage,et al.  Purest ever example-based machine translation: Detailed presentation and assessment , 2005, Machine Translation.

[4]  Krister Lindén,et al.  Guessers for Finite-State Transducer Lexicons , 2009, CICLing.

[5]  David Eddington,et al.  PARADIGM UNIFORMITY AND ANALOGY: THE CAPITALISTIC VERSUS MILITARISTIC DEBATE , 2009 .

[6]  Kenneth J. Kurtz,et al.  Converging on a new role for analogy in problem solving and retrieval: when two problems are better than one , 2007, Memory & cognition.

[7]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner , 2007 .

[8]  François Yvon,et al.  Formal Models of Analogical Proportions , 2007 .

[9]  Peter D. Turney A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations , 2008, COLING.

[10]  Vincent Claveau,et al.  Automatic Morphological Query Expansion Using Analogy-Based Machine Learning , 2007, ECIR.

[11]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[12]  Petra Barg,et al.  Incremental Identification of Inflectional Types , 2000, COLING.

[13]  Tommi A. Pirinen,et al.  HFST Tools for Morphology - An Efficient Open-Source Package for Construction of Morphological Analyzers , 2009, SFCM.

[14]  Krister Lindén,et al.  Corpus-Based Lexeme Ranking for Morphological Guessers , 2009, SFCM.

[15]  Dedre Gentner,et al.  Analogical Encoding: Facilitating Knowledge Transfer and Integration , 2004 .

[16]  François Yvon,et al.  An Analogical Learner for Morphological Analysis , 2005, CoNLL.

[17]  Andrei Mikheev Unsupervised Learning of Word-Category Guessing Rules , 1996, ACL.

[18]  Markus Forsberg,et al.  Morphological Lexicon Extraction from Raw Text Data , 2006, FinTAL.

[19]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[20]  David Yarowsky,et al.  Minimally Supervised Morphological Analysis by Multimodal Alignment , 2000, ACL.

[21]  Jacques Sakarovitch,et al.  Introducing VAUCANSON , 2004, Theor. Comput. Sci..

[22]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[23]  Krister Lindén,et al.  A Probabilistic Model for Guessing Base Forms of New Words by Analogy , 2008, CICLing.

[24]  James Jay Horning,et al.  A study of grammatical inference , 1969 .

[25]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[26]  Andrei Mikheev,et al.  Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[27]  John Goldsmith,et al.  Segmentation and morphology , 2010 .

[28]  Yves Lepage,et al.  Solving Analogies on Words: An Algorithm , 1998, COLING-ACL.

[29]  Yves Lepage,et al.  Analogy and Formal Languages , 2004, FGMOL.

[30]  Timothy Baldwin,et al.  Bootstrapping Deep Lexical Resources: Resources for Courses , 2005, ACL 2005.

[31]  Marco Baroni,et al.  Unsupervised discovery of morphologically related words based on orthographic and semantic similarity , 2002, SIGMORPHON.

[32]  Krister Lindén,et al.  Corpus-based Paradigm Selection for Morphological Entries , 2009, NODALIDA.

[33]  Royal Skousen,et al.  Analogical Modeling Of Language , 1989 .

[34]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[35]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[36]  Emmanuel Keuleers,et al.  Dutch plural inflection: The exception that proves the analogy , 2007, Cognitive Psychology.

[37]  D. Gentner,et al.  Analogical Learning in Negotiation Teams : Comparing Cases Promotes Learning and Transfer , 2003 .

[38]  Royal Skousen Analogical Modeling: Exemplars, Rules, and Quantum Computing , 2003 .

[39]  John Goldsmith,et al.  Morphological Analogy: Only a Beginning , 2008 .

[40]  Krister Lindén Assigning an Inflectional Paradigm using the Longest Matching Affix , 2008 .

[41]  Mikko Kurimo,et al.  Overview of Morpho Challenge in CLEF 2007 , 2007, CLEF.

[42]  Richard Wicentowski Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model , 2004, SIGMORPHON@ACL.

[43]  David Yarowsky,et al.  Modeling and learning multilingual inflectional morphology in a minimally supervised framework , 2003 .

[44]  Ann A. Copestake,et al.  The ACQUILEX LKB: representation issues in semi-automatic acquisition of large lexicons , 1992, ANLP.

[45]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[46]  Yves Lepage Languages Of Analogical Strings , 2000, COLING.

[47]  Mathias Creutz,et al.  Morfessor and Hutmegs: Unsupervised Morpheme Segmentation for Highly-Inflecting and Compounding Languages , 2005 .

[48]  Petra Barg,et al.  Processing Unknown Words in HPSG , 1998, ACL.