Improving Suffix Tree Clustering Algorithm for Web Documents

Web document clustering results can help users quickly locate the information they need among the results search engines returned. According to the characteristics of the suffix tree structure and the flaws of similarity calculation in STC algorithm's cluster merging, this paper proposes an improved suffix tree clustering method. The method combines vector space model with Pearson correlation coefficient, calculates the relevant of clusters based on document vector of all clusters, and then utilizes the relevant vectors of clusters and the correlations between them to calculate the similarity for cluster merging, improves the clustering process of documents. Analysis of the experimental results shows that the method outperforms the original STC algorithm on Web documents clustering.

[1]  Benxiong Huang,et al.  Web Search Results Clustering Based on a Novel Suffix Tree Structure , 2008, ATC.

[2]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[3]  Sumanta Guha,et al.  Semantic Suffix Tree Clustering , 2010 .

[4]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[5]  Xiaoying Gao,et al.  Improving Suffix Tree Clustering with New Ranking and Similarity Measures , 2011, ADMA.

[6]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[7]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  Xiaoying Gao,et al.  Improving Web clustering by cluster selection , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[9]  M. Rafi,et al.  A comparison of two suffix tree-based document clustering algorithms , 2010, 2010 International Conference on Information and Emerging Technologies.

[10]  Sumanta Guha,et al.  Applying Semantic Suffix Net to suffix tree clustering , 2011, 2011 3rd Conference on Data Mining and Optimization (DMO).

[11]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[12]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[13]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[14]  Jiangning Wu,et al.  Search Results Clustering in Chinese Context Based on a New Suffix Tree , 2008, 2008 IEEE 8th International Conference on Computer and Information Technology Workshops.

[15]  Murtaza Munawar Fazal,et al.  A comparison of two suffix tree-based document clustering algorithms , 2010 .