A Simple Approach to Unknown Word Processing in Japanese Morphological Analysis

This paper presents a simple but effective approach to unknown word processing in Japanese morphological analysis, which handles 1) unknown words that are derived from words in a pre-defined lexicon and 2) unknown onomatopoeias. Our approach leverages derivation rules and onomatopoeia patterns, and correctly recognizes certain types of unknown words. Experiments revealed that our approach recognized about 4,500 unknown words in 100,000 Web sentences with only 80 harmful side effects and a 6% loss in speed.

[1]  Fei Liu,et al.  A Broad-Coverage Normalization System for Social Media Language , 2012, ACL.

[2]  Kazunori Matsumoto,et al.  Unsupervised Text Normalization Approach for Morphological Analysis of Blog Documents , 2009, Australasian Conference on Artificial Intelligence.

[3]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[4]  Masaaki Nagata A Part of Speech Estimation Method for Japanese Unknown Words using a Statistical Model of Morphology and Context , 1999, ACL.

[5]  Hitoshi Isahara,et al.  The Unknown Word Problem: a Morphological Analysis of Japanese Using Maximum Entropy Aided by a Dictionary , 2001, EMNLP.

[6]  Daisuke Kawahara,et al.  TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access Methodology , 2008, IJCNLP.

[7]  Yasuharu Den,et al.  A Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation , 2008, LREC.

[8]  Cédrick Fairon,et al.  A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[9]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[10]  Chris Brockett,et al.  Robust Segmentation of Japanese Text into a Lattice for Parsing , 2000, COLING.

[11]  Yuji Matsumoto,et al.  Extended Models and Tools for High-performance Part-of-speech , 2000, COLING.

[12]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[13]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[14]  Yuji Matsumoto,et al.  Japanese Unknown Word Identification by Character-based Chunking , 2004, COLING.

[15]  Yugo Murawaki,et al.  Online Acquisition of Japanese Unknown Morphemes using Morphological Constraints , 2008, EMNLP.

[16]  Tetsuji Nakagawa,et al.  A Hybrid Approach to Word Segmentation and POS Tagging , 2007, ACL.

[17]  Makoto Nagao,et al.  Word Extraction from Corpora and Its Part-of-Speech Estimation Using Distributional Analysis , 1996, COLING.

[18]  Taku Kudo,et al.  MeCab : Yet Another Part-of-Speech and Morphological Analyzer , 2005 .

[19]  B. Lyman,et al.  The change from surd to sonant in Japanese compounds , 1894 .