Fuzzy co-clustering of Web documents

The Web is the largest information repository in the history of mankind. Due to its huge size however, finding relevant information without any appropriate tool can be virtually impossible. Web document clustering is one possible technique to improve the efficiency in information finding process. In this paper, we are looking into fuzzy co-clustering, which is known to be robust for clustering standard text documents. In our opinion, its robustness can also be extended to Web documents because it can generate descriptive clusters in high dimension and it is able to discover data clusters with overlaps. We consider two existing fuzzy co-clustering algorithms, FCCM and fuzzy Codok. In addition, we propose a new algorithm, FCC-STF, as an alternative to the existing ones. Empirical study of these algorithms on benchmark datasets is presented, together with the performance comparison with a standard fuzzy clustering algorithm HFCM. The results show that fuzzy co-clustering is generally superior to standard fuzzy clustering in the Web environment, making it a technique with great potential to assist Internet user in discovering relevant information effectively

[1]  Sachindra Joshi,et al.  A matrix density based algorithm to hierarchically co-cluster documents and words , 2003, WWW '03.

[2]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[3]  Raghu Krishnapuram,et al.  Fuzzy co-clustering of documents and keywords , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[4]  Mark P. Sinka,et al.  A Large Benchmark Dataset for Web Document Clustering , 2002 .

[5]  Enrique H. Ruspini,et al.  A New Approach to Clustering , 1969, Inf. Control..

[6]  L. Sacks,et al.  Evaluating fuzzy clustering for relevance-based information access , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[7]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[8]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[9]  Hichem Frigui,et al.  Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents , 2004 .

[10]  Hidetomo Ichihashi,et al.  Fuzzy clustering for categorical multivariate data , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).