An evolutionary approach to automatic Chinese text segmentation

Textual information written in Chinese now represents a huge knowledge repository. The first step of managing and processing information in written Chinese text is segmentation. A new method for automatic Chinese text segmentation using evolutionary algorithms and Web search statistical data is outlined. This proposed method considers Web text a de facto corpus that updates automatically, thus eliminating the need for statistics training. It treats the segmentation as a process that finds out the best probability of how individual characters are combined into sentences, paragraphs, and articles, thus producing segmentation results that are tailored to the text in question and are independent of segmentation standards.

[1]  Yingying Wen,et al.  A compression based algorithm for Chinese word segmentation , 2000, CL.

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  Yan Niu,et al.  An Improved Chinese Segmentation Algorithm Based on New Dictionary Construction , 2009, 2009 International Conference on Computational Science and Engineering.

[4]  Richard Sproat,et al.  Corpus-Based Methods in Chinese Morphology and Phonology , 2001 .

[5]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[6]  Eric Shen Liu Frequency Dictionary of Chinese Words , 1974 .

[7]  Gwyneth Tseng,et al.  Chinese text segmentation for text retrieval: achievements and problems , 1993 .

[8]  J. D. White,et al.  Computer processing of Chinese characters: An overview of two decades' research and development , 1990, Inf. Process. Manag..

[9]  Eiichiro Sumita,et al.  Subword-based Tagging by Conditional Random Fields for Chinese Word Segmentation , 2006, NAACL.

[10]  Xiaofei Lu Towards a Hybrid Model for Chinese Word Segmentation , 2005, SIGHAN@IJCNLP 2005.

[11]  Christopher C. Yang,et al.  A heuristic method based on a statistical approach for Chinese text segmentation , 2005, J. Assoc. Inf. Sci. Technol..

[12]  Helen M. Meng,et al.  An Analytical Study of Transformational Tagging for Chinese Text , 1999, ROCLING.

[13]  Christopher C. Yang,et al.  Combination and boundary detection approaches on Chinese indexing , 2000 .

[14]  Maosong Sun,et al.  Word Segmentation Standard in Chinese, Japanese and Korean , 2009, ALR7@IJCNLP.

[15]  Lillian Lee,et al.  Mostly-unsupervised statistical segmentation of Japanese kanji sequences , 2002, Natural Language Engineering.

[16]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[17]  Changning Huang,et al.  Chinese Word Segmentation: A Pragmatic Approach , 2004 .

[18]  Yanxiang He,et al.  A Trigram Statistical Language Model Algorithm for Chinese Word Segmentation , 2007, FAW.

[19]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[20]  Fei Xia,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[21]  Jianfeng Gao,et al.  Lexicon Optimization for Chinese Language Modeling , 2000 .

[22]  Frederick Jelinek,et al.  A study of n-gram and decision tree letter language modeling methods , 1998, Speech Commun..

[23]  Hai Zhao,et al.  An Improved Chinese Word Segmentation System with Conditional Random Field , 2006, SIGHAN@COLING/ACL.

[24]  Daniel Dajun Zeng,et al.  Domain-specific Chinese word segmentation using suffix tree and mutual information , 2011, Inf. Syst. Frontiers.

[25]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[26]  Thomas Bäck,et al.  Evolutionary computation: comments on the history and current state , 1997, IEEE Trans. Evol. Comput..

[27]  Nianwen Xue,et al.  Building a Large-Scale Annotated Chinese Corpus , 2002, COLING.

[28]  Lillian Lee,et al.  Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji , 2000, ANLP.

[29]  John J. Grefenstette,et al.  How Genetic Algorithms Work: A Critical Look at Implicit Parallelism , 1989, ICGA.

[30]  Yong Qin,et al.  A search-based Chinese word segmentation method , 2007, WWW '07.

[31]  Elisabeth Selkirk,et al.  The syntax of words , 1982 .

[32]  Charles N. Li,et al.  Mandarin Chinese: A Functional Reference Grammar , 1989 .

[33]  J. Packard The Morphology of Chinese: A Linguistic and Cognitive Approach , 2000 .

[34]  Michael Picheny,et al.  Use of statistical N-gram models in natural language generation for machine translation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[35]  Yuji Matsumoto,et al.  Chinese Unknown Word Identification Using Character-based Tagging and Chunking , 2003, ACL.

[36]  Keh-Yih Su,et al.  A Corpus-Based Approach to Automatic Compound Extraction , 1994, ACL.

[37]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[38]  Yuji Matsumoto,et al.  Pruning False Unknown Words to Improve Chinese Word Segmentation , 2004, PACLIC.

[39]  Xiao Chen,et al.  The Fourth International Chinese Language Processing Bakeoff: Chinese Word Segmentation, Named Entity Recognition and Chinese POS Tagging , 2008, IJCNLP.

[40]  Richard M. Friedberg,et al.  A Learning Machine: Part I , 1958, IBM J. Res. Dev..

[41]  Chunyu Kit,et al.  Chinese word segmentation as morpheme-based lexical chunking , 2008, Inf. Sci..

[42]  Qun Liu,et al.  Chinese Lexical Analysis Using Hierarchical Hidden Markov Model , 2003, SIGHAN.

[43]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 1998, ACL.

[44]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[45]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[46]  E. Williams,et al.  On the definition of word , 1987 .

[47]  Maosong Sun Computations on Chinese morphology , 2004 .

[48]  Changning Huang,et al.  Improved Source-Channel Models for Chinese Word Segmentation , 2003, ACL.

[49]  Wang Xiaojie,et al.  Combining Multi-knowledge for Chinese Word Segmentation Disambiguation , 2006, Sixth International Conference on Intelligent Systems Design and Applications.

[50]  Peng Jin,et al.  A Chinese Corpus with Word Sense Annotation , 2006, ICCPOL.

[51]  Zhimao Lu,et al.  Combining Neural Networks and Statistics for Chinese Word Sense Disambiguation , 2004, SIGHAN@ACL.

[52]  Mark Davis,et al.  The Unicode Standard, Version 3.0 , 2000 .

[53]  Maosong Sun,et al.  Word Frequency Approximation for Chinese Without Using Manually-Annotated Corpus , 2006, CICLing.

[54]  Gina-Anne Levow,et al.  The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[55]  Andi Wu,et al.  Statistically-Enhanced New Word Identification in a Rule-Based Chinese System , 2000, ACL 2000.

[56]  Aitao Chen,et al.  Unigram Language Model for Chinese Word Segmentation , 2005, SIGHAN@IJCNLP 2005.

[57]  Francis Jack Smith,et al.  A weighted average n-gram model of natural language , 1994, Comput. Speech Lang..

[58]  David D. Palmer,et al.  A Trainable Rule-Based Algorithm for Word Segmentation , 1997, ACL.

[59]  Su-qin Feng,et al.  Context-Based Approach for Covering Ambiguity Resolution in Chinese Word Segmentation , 2009, 2009 Second International Conference on Information and Computing Science.

[60]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[61]  Fredric C. Gey,et al.  Chinese text retrieval without using a dictionary , 1997, SIGIR '97.

[62]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[63]  Chu-Ren Huang,et al.  Segmentation Standard for Chinese Natural Language Processing , 1996, COLING.

[64]  Yuji Matsumoto,et al.  Combining Segmenter and Chunker for Chinese Word Segmentation , 2003, SIGHAN.

[65]  Dale Schuurmans,et al.  Self-Supervised Chinese Word Segmentation , 2001, IDA.

[66]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[67]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[68]  Maosong Sun,et al.  Covering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information , 2002, COLING.

[69]  Stephen E. Robertson,et al.  Applying Machine Learning to Text Segmentation for Information Retrieval , 2004, Information Retrieval.

[70]  Fei Xia The Segmentation Guidelines for the Penn Chinese Treebank (3.0) , 2000 .

[71]  Quan Zhang,et al.  A New Way for Chinese Place Name Recognition , 2009, 2009 International Conference on Asian Language Processing.

[72]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[73]  Stephen R. Anderson,et al.  A-Morphous morphology , 1992 .

[74]  Heinz Mühlenbein,et al.  The Science of Breeding and Its Application to the Breeder Genetic Algorithm (BGA) , 1993, Evolutionary Computation.

[75]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[76]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[77]  Nianwen Xue,et al.  Chinese Word Segmentation as LMR Tagging , 2003, SIGHAN.

[78]  Ying Xiong,et al.  A New Machine Learning Method for Chinese Overlapping Disambiguity--Conditional Random Fields , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[79]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[80]  Keh-Jiann Chen,et al.  Word Identification for Mandarin Chinese Sentences , 1992, COLING.

[81]  George E. P. Box,et al.  Evolutionary Operation: a Method for Increasing Industrial Productivity , 1957 .

[82]  Christopher S. G. Khoo,et al.  A new statistical formula for Chinese text segmentation incorporating contextual information , 1999, SIGIR '99.

[83]  Christopher S. G. Khoo,et al.  Using statistical and contextual information to identify two- and three-character words in Chinese text , 2002, J. Assoc. Inf. Sci. Technol..

[84]  Andi Wu,et al.  Customizable Segmentation of Morphologically Derived Words in Chinese , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..

[85]  Christopher J. Fox,et al.  A stop list for general text , 1989, SIGF.

[86]  Jianfeng Gao,et al.  Toward a unified approach to statistical language modeling for Chinese , 2002, TALIP.

[87]  Hai Zhao,et al.  Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition , 2008, IJCNLP.

[88]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[89]  David M. W. Powers,et al.  Chinese Word Segmentation Based on Contextual Entropy , 2003, PACLIC.