Improving Suffix Tree Clustering with New Ranking and Similarity Measures

Retrieving relevant information from web, containing enormous amount of data, is a highly complicated research area. A landmark research that contributes to this area is web clustering which efficiently organizes a large amount of web documents into a small number of meaningful and coherent groups[1,2]. Various techniques aim at accurately categorizing the web pages into clusters automatically. Suffix Tree Clustering (STC) is a phrase-based, state-of-art algorithm for web clustering that automatically groups semantically related documents based on shared phrases. Research has shown that it has outperformed other clustering algorithms such as K-means and Buckshot due to its efficient utilization of phrases to identify the clusters. Using STC as the baseline, we introduce a new method for ranking base clusters and new similarity measures for comparing clusters. Our STHAC technique combines the Heirarchical Agglomerative clustering method with phrase based Suffix Tree clustering to improve the cluster merging process. Experimental results have shown that STHAC outperforms the original STC as well as ESTC(our precious extended version of STC) with 16% increase in F-measure. This increase in F-measure of STHAC is achieved due to its better filtering of low score clusters, better similarity measures and efficient cluster merging algorithms.

[1]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[2]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[3]  Brett Kessler,et al.  Computational dialectology in Irish Gaelic , 1995, EACL.

[4]  Jiangning Wu,et al.  Search Results Clustering in Chinese Context Based on a New Suffix Tree , 2008, 2008 IEEE 8th International Conference on Computer and Information Technology Workshops.

[5]  Xiaotie Deng,et al.  A new suffix tree similarity measure for document clustering , 2007, WWW '07.

[6]  Arne Andersson,et al.  Efficient implementation of suffix trees , 1995, Softw. Pract. Exp..

[7]  Robert M. Losee When information retrieval measures agree about the relative quality of document rankings , 2000 .

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[10]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[11]  Jianhua Wang,et al.  A New Cluster Merging Algorithm of Suffix Tree Clustering , 2007, Enterprise Information Systems and Web Technologies.

[12]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[13]  Xiaotie Deng,et al.  Efficient Phrase-Based Document Similarity for Clustering , 2008, IEEE Transactions on Knowledge and Data Engineering.

[14]  Constantine Stephanidis Intelligent and ubiquitous interaction environments , 2009 .

[15]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[16]  Ujwala Bharambe,et al.  A New Suffix Tree Similarity Measure and Labeling for Web Search Results Clustering , 2009, 2009 Second International Conference on Emerging Trends in Engineering & Technology.

[17]  Dell Zhang,et al.  Semantic, Hierarchical, Online Clustering of Web Search Results , 2004, APWeb.

[18]  Bjørn Kjos-Hanssen,et al.  Google distance between words , 2009, ArXiv.

[19]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[20]  G. W. Milligan,et al.  A Two-Stage Clustering Algorithm with Robust Recovery Characteristics , 1980 .

[21]  Mohamed S. Kamel,et al.  Phrase-based document similarity based on an index graph model , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[22]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[23]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[24]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[25]  Yanchun Zhang,et al.  Advanced Web Technologies and Applications , 2004, Lecture Notes in Computer Science.

[26]  Arne Andersson,et al.  Suffix Trees on Words , 1996, CPM.

[27]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[28]  Dawid Weiss,et al.  A concept-driven algorithm for clustering search results , 2005, IEEE Intelligent Systems.

[29]  Bruno Martins,et al.  Universal Mobile Information Retrieval , 2009, HCI.

[30]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[31]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[32]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[33]  Xiaoying Gao,et al.  Query Directed Web Page Clustering , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[34]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[35]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[36]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[37]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[38]  Constantine Stephanidis,et al.  Universal Access in Human-Computer Interaction , 2011 .

[39]  Xiaoying Gao,et al.  Improving Web clustering by cluster selection , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).