Semantic feature reduction in chinese document clustering

Text clustering techniques were usually used to structure the text documents into topic related groups which can facilitate users to get a comprehensive understanding on corpus or results from information retrieval system. Most of existing text clustering algorithm which derived from traditional formatted data clustering heavily rely on term analysis methods and adopted vector space model (VSM) as their document representation. But because of the essential characteristic underlying text such as high dimensionality features vector space, the problem of sparseness has a strong impact on the clustering algorithm. So feature reduction is an important preprocess step for improving the efficiency and accuracy of clustering algorithm by removing redundant and irrelevant terms from corpus. Even the clustering is considered as an unsupervised learning method, but in text, there is still some priori knowledge we can use from NLP analysis based approach. In this paper, we propose a semantic analysis based feature reduction method which used in Chinese text clustering. Our method bases on a dedicated Part-of-Speech tags selection and synonyms consolidation and can reduce the feature space of documents more effectively compared with traditional feature reduction method tfidf and stopwords removal; meanwhile it preserves or sometimes even improves the accuracy of clustering algorithm. In our experiment, we tested our feature reduction method using bisecting k-means algorithm which was proved be efficient in text clustering. The results show that our method can reduce the feature space significantly, and meanwhile have a better clustering accuracy in terms of the purity.

[1]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[2]  Marco Maggini,et al.  A semi-supervised document clustering algorithm based on EM , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[3]  Roberto Basili,et al.  Language sensitive text classification , 2000, RIAO.

[4]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[5]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[6]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[7]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[8]  Santosh S. Vempala,et al.  An algorithmic theory of learning: Robust concepts and random projection , 1999, Machine Learning.

[9]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[10]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[11]  Steffen Staab,et al.  Ontology-based text clustering , 2001, IJCAI 2001.

[12]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[13]  Marco Maggini,et al.  Pseudo-Supervised Clustering for Text Documents , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[14]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[15]  Shiwen Yu,et al.  Specification for Corpus Processing at Peking University: Word Segmentation, POS Tagging and Phonetic Notation , 2003, J. Chin. Lang. Comput..

[16]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[17]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[18]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[19]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[20]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[21]  Eric Brill,et al.  Beyond PageRank: machine learning for static ranking , 2006, WWW '06.

[22]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[23]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.