Applying Machine Learning to Text Segmentation for Information Retrieval

We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are flexible and adaptive. We find that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give good retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we find that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text. Our research suggests machine learning techniques can play an important role in building adaptable information retrieval systems and different evaluation standards for word segmentation should be given to different applications.

[1]  Kui-Lam Kwok,et al.  TREC-5 English and Chinese Retrieval Experiments using PIRCS , 1996, TREC.

[2]  Dale Schuurmans,et al.  Self-Supervised Chinese Word Segmentation , 2001, IDA.

[3]  Jian Zhang,et al.  On the use of words and n-grams for Chinese information retrieval , 2000, IRAL '00.

[4]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[5]  Kui-Lam Kwok Comparing representations in Chinese information retrieval , 1997, SIGIR '97.

[6]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[7]  Jian-Yun Nie,et al.  Chinese information retrieval: using characters or words? , 1999, Inf. Process. Manag..

[8]  M. Brent,et al.  On the discovery of novel wordlike units from utterances: an artificial-language study with implications for native-language acquisition. , 1999, Journal of experimental psychology. General.

[9]  Fredric C. Gey,et al.  Chinese text retrieval without using a dictionary , 1997, SIGIR '97.

[10]  Stephen E. Robertson,et al.  Okapi Chinese Text Retrieval Experiments at TREC-6 , 1997, TREC.

[11]  Fredric C. Gey,et al.  Term importance, Boolean conjunct training, negative terms, and foreign language retrieval: probabilistic algorithms at TREC-5 , 1996, TREC.

[12]  M. Brent,et al.  On the discovery of novel wordlike units from utterances: an artificial-language study with implications for native-language acquisition. , 1999, Journal of experimental psychology. General.

[13]  Yingying Wen,et al.  A compression based algorithm for Chinese word segmentation , 2000, CL.

[14]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[15]  Pascale Fung Extracting Key Terms from Chinese and Japanese texts , 1998 .

[16]  Jian-Yun Nie,et al.  On Chinese text retrieval , 1996, SIGIR '96.

[17]  J. Ponte USe: A Retargetable Word Segmentation Procedure for Information Retrieval , 1996 .

[18]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[19]  Chris Brew,et al.  Error-Driven Learning of Chinese Word Segmentation , 1998, PACLIC.

[20]  Stephen E. Robertson,et al.  Okapi at TREC-5 , 1996, TREC.

[21]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[22]  Karen Sparck Jones The role of artificial intelligence in information retrieval , 1991 .

[23]  Gwyneth Tseng,et al.  Chinese text segmentation for text retrieval: achievements and problems , 1993 .

[24]  Wanda Pratt,et al.  Discovering Chinese words from unsegmented text (poster abstract) , 1999, SIGIR '99.

[25]  Keh-Yih Su,et al.  An Unsupervised Iterative Method for Chinese New Lexicon Extraction , 1997, ROCLING/IJCLCLP.

[26]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[27]  David D. Palmer,et al.  Chinese Word Segmentation and Information Retrieval , 1997 .

[28]  RetrievalJay M. Ponte,et al.  USeg : A Retargetable Word SegmentationProcedure for Information , 1996 .

[29]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[30]  Chris Buckley,et al.  Using Query Zoning and Correlation Within SMART: TREC 5 , 1996, TREC.

[31]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[32]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[33]  K. L. Kwok Employing multiple representations for Chinese information retrieval , 1999 .

[34]  Kui-Lam Kwok Improving English and Chinese Ad-Hoc Retrieval: A Tipster Text Phase 3 Project Report , 2004, Information Retrieval.

[35]  Karen Spärck Jones Search Term Relevance Weighting given Little Relevance Information , 1997, J. Documentation.

[36]  Ross Wilkinson,et al.  Chinese Document Retrieval at TREC-6 , 1997, TREC.

[37]  Padhraic Smyth,et al.  Discovering Chinese Words from Unsegmented Text , 1999, SIGIR 1999.

[38]  Claire Cardie,et al.  Using clustering and SuperConcepts within SMART: TREC 6 , 1997, Inf. Process. Manag..

[39]  Schubert Foo,et al.  Chinese Word Segmentation Accuracy and Its Effects on Information Retrieval , 2004 .

[40]  Karen Spärck Jones The role of artificial intelligence in information retrieval , 1991, J. Am. Soc. Inf. Sci..

[41]  Xiaopeng Tao,et al.  Chinese Text Segmentation With MBDP-1: Making the Most of Training Corpora , 2001, ACL.

[42]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..