Automated Extraction of Lexicon Applied both to Chinese and Japanese Corpora

A novel statistical approach is described, enabling the automated extraction of large word lists from unsegmented corpora without reliance on existing dictionaries. The main contribution of this approach includes the following two points: First, it's very generic and has been successfully applied separately to both Chinese and Japanese, Second, it doesn't take any use of punctuation information, so compared to most of the existing methods, it doesn't need to pre-process the corpora to remove the punctuations or to pre-segment the corpora by punctuations. Our experiment results in the extraction of 14,087 Chinese words and 15,553 Japanese words. Precision achieved is over 80% for two-character Chinese words, over 90% for one-character Japanese words and over 70% for two-character Japanese words. And we've also successfully extracted most of single-character words including common functional characters, such in, and, or, 's, also, a family name in Chinese, hiragana such as " ?,"" ?,"" ?" in Japanese, and punctuations such as ",", "", "?".

[1]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[2]  Eiichiro Sumita,et al.  A Japanese Sentence Analyzer , 1988, IBM J. Res. Dev..

[3]  Xiaotie Deng,et al.  Accessor Variety Criteria for Chinese Word Extraction , 2004, CL.

[4]  Hai Zhao,et al.  Integrating unsupervised and supervised word segmentation: The role of goodness measures , 2011, Inf. Sci..

[5]  Zimin Wu,et al.  Chinese Text Segmentation for Text Retrieval: Achievements and Problems , 1993, J. Am. Soc. Inf. Sci..

[6]  Tetsuji Nakagawa,et al.  Chinese and Japanese Word Segmentation Using Word-Level and Character-Level Information , 2004, COLING.

[7]  Kam-Fai Wong,et al.  A Chinese dictionary construction algorithm for information retrieval , 2002, TALIP.

[8]  David D. Palmer A trainable rule-based algorithm for word segmentation , 1997 .

[9]  Gwyneth Tseng,et al.  Chinese text segmentation for text retrieval: achievements and problems , 1993 .

[10]  Yih-Jeng Lin,et al.  Extracting Chinese Frequent Strings Without Dictionary From a Chinese corpus, its Applications , 2001, J. Inf. Sci. Eng..

[11]  Nianwen Xue,et al.  Chinese Word Segmentation as Character Tagging , 2003, ROCLING/IJCLCLP.

[12]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 1998, ACL.

[13]  Liang Shunpan Analysis and Study on the Method of Extracting Words Without Dictionary , 2006 .

[14]  RetrievalJay M. Ponte,et al.  USeg : A Retargetable Word SegmentationProcedure for Information , 1996 .

[15]  Nianwen Xu,et al.  Chinese Word Segmentation as Character Tagging , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..

[16]  Kenji Kita,et al.  A comparative study of automatic extraction of collocations from corpora: mutual information vs , 1994 .

[17]  Wanda Pratt,et al.  Discovering Chinese words from unsegmented text (poster abstract) , 1999, SIGIR '99.

[18]  Padhraic Smyth,et al.  Discovering Chinese Words from Unsegmented Text , 1999, SIGIR 1999.

[19]  Christopher S. G. Khoo,et al.  A new statistical formula for Chinese text segmentation incorporating contextual information , 1999, SIGIR '99.

[20]  Hai Zhao,et al.  A Unified Character-Based Tagging Framework for Chinese Word Segmentation , 2010, TALIP.

[21]  Keh-Yih Su,et al.  An Unsupervised Iterative Method for Chinese New Lexicon Extraction , 1997, ROCLING/IJCLCLP.

[22]  Yih-Jeng Lin,et al.  Extracting Chinese Frequent Strings Without a Dictionary From a Chinese Corpus and its Applications , 2001 .