Language Independent Morphological Analysis

This paper proposes a framework of language independent morphological analysis and mainly concentrate on tokenization, the first process of morphological analysis. Although tokenization is usually not regarded as a difficult task in most segmented languages such as English, there are a number of problems in achieving precise treatment of lexical entries. We first introduce the concept of morpho-fragments, which are intermediate units between characters and lexical entries. We describe our approach to resolve problems arising in tokenization so as to attain a language independent morphological analyzer.

[1]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[2]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[3]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[4]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[5]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[6]  Yuji Matsumoto,et al.  A Proposal of Korean Conjugation System and its Application to Morphological Analysis , 1996, PACLIC.

[7]  Christopher J. Fox,et al.  Lexical Analysis and Stoplists , 1992, Information Retrieval: Data Structures & Algorithms.

[8]  Chunyu Kit,et al.  Tokenization as the Initial Phase in NLP , 1992, COLING.

[9]  Masaaki Nagata,et al.  A Stochastic Japanese Morphological Analyzer Using a Forward-DP Backward-A* N-Best Search Algorithm , 1994, COLING.

[10]  Jin Guo,et al.  Critical Tokenization and its Properties , 1997, Comput. Linguistics.

[11]  Yuji Matsumoto,et al.  Japanese Morphological Analysis System ChaSen version 2.0 Manual , 1999 .

[12]  刘江雪,et al.  LIN volume 11 issue 2 Cover and Back matter , 1975, Journal of Linguistics.

[13]  Masaaki Nagata A Part of Speech Estimation Method for Japanese Unknown Words using a Statistical Model of Morphology and Context , 1999, ACL.

[14]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[15]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[16]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[17]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[18]  Jon Mills Lexicon Based Critical Tokenisation: An Algorithm , 1998 .

[19]  Marti A. Hearst,et al.  Adaptive Multilingual Sentence Boundary Disambiguation , 1997, CL.