A New Semantic-Based Hierarchical Clustering For Textual Data Algorithm

The suffix tree clustering (STC) is a fast incremental clustering and it uses a tree form as a structure to represent data. However, hierarchy of clusters that are generated by taking the hierarchy of the constructed suffix tree directly is not reasonable. That reason is meaning of the words are ignored and some large-sized with poor quality of clusters are returned by STC. In this paper, a new semantic-based hierarchical clustering for textual data is proposed. The propose uses semantic suffix net as a structure to represent data and net pruning techniques as a logic to combine the related suffixes through their suffix links. Consequently, the hierarchy of clusters is generated per se when the resulting of tree pruning techniques is a tree form. While textual information can be group by using both string matching and meaning of the words as well as the hierarchy of their clusters can be return directly. That means the quality of clusters is increasing. Also, the merge or split suffered of a hierarchical clustering is reduced. Therefore, the propose algorithm can be used to reduce the suffered of a hierarchical clustering and enhance the quality of clusters.

[1]  Xiaohua Hu,et al.  A coherent graph-based semantic clustering and summarization approach for biomedical literature and a new summarization evaluation method , 2007, BMC Bioinformatics.

[2]  Wanli Zuo,et al.  Semantic-Based Hierarchicalize the Result of Suffix Tree Clustering , 2009, 2009 Second International Symposium on Knowledge Acquisition and Modeling.

[3]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2005, WWW '05.

[4]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[5]  Tian Weixin,et al.  Text Document Clustering Based on the Modifying Relations , 2008, 2008 International Conference on Computer Science and Software Engineering.

[6]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[7]  Sumanta Guha,et al.  Applying Semantic Suffix Net to suffix tree clustering , 2011, 2011 3rd Conference on Data Mining and Optimization (DMO).

[8]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .