Hard and fuzzy diagonal co-clustering for document-term partitioning

We propose a hard and a fuzzy diagonal co-clustering algorithms built upon the double K-means to address the problem of document-term co-clustering. At each iteration, the proposed algorithms seek a diagonal block structure of the data by minimizing a criterion based on both the variance within the class and the centroid effect. In addition to be easy-to-interpret and effective on sparse binary and continuous data, the proposed algorithms, Hard Diagonal Double K-means (DDKM) and Fuzzy Diagonal Double K-means (F-DDKM), are also faster than other state-of-the-art clustering algorithms. We evaluate our contribution using synthetic data sets, and real data sets commonly used in document clustering.

[1]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[2]  Roded Sharan,et al.  Biclustering Algorithms: A Survey , 2007 .

[3]  Gérard Govaert,et al.  Block clustering with Bernoulli mixture models: Comparison of different approaches , 2008, Comput. Stat. Data Anal..

[4]  Mohamed Ben Ahmed,et al.  Block Clustering for Web Pages Categorization , 2009, IDEAL.

[5]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[6]  Marimuthu Palaniswami,et al.  Fuzzy c-Means Algorithms for Very Large Data , 2012, IEEE Transactions on Fuzzy Systems.

[7]  Tao Li,et al.  A general model for clustering binary data , 2005, KDD '05.

[8]  Mohamed Nadif,et al.  Co-clustering , 2013, Encyclopedia of Database Systems.

[9]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[10]  Pengjiang Qian,et al.  Collaborative Fuzzy Clustering From Multiple Weighted Views , 2015, IEEE Transactions on Cybernetics.

[11]  William-Chandra Tjhi,et al.  A heuristic-based fuzzy co-clustering algorithm for categorization of high-dimensional data , 2008, Fuzzy Sets Syst..

[12]  C. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and K-means - Spectral Clustering , 2005 .

[13]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Maurizio Vichi,et al.  Two-mode multi-partitioning , 2008, Comput. Stat. Data Anal..

[15]  Mohamed Nadif,et al.  Fuzzy clustering to estimate the parameters of block mixture models , 2006, Soft Comput..

[16]  Hidetomo Ichihashi,et al.  Fuzzy clustering for categorical multivariate data , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[17]  H. Bock Convexity-based clustering criteria: theory, algorithms, and applications in statistics , 2004 .

[18]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[19]  Raghu Krishnapuram,et al.  Fuzzy co-clustering of documents and keywords , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[20]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[21]  Mohamed Nadif,et al.  Co-clustering for Binary and Categorical Data with Maximum Modularity , 2011, 2011 IEEE 11th International Conference on Data Mining.

[22]  C MadeiraSara,et al.  Biclustering Algorithms for Biological Data Analysis , 2004 .

[23]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[24]  Hans-Hermann Bock,et al.  Two-mode clustering methods: astructuredoverview , 2004, Statistical methods in medical research.

[25]  L. Hubert,et al.  Comparing partitions , 1985 .

[26]  Michael W. Berry,et al.  Algorithms and applications for approximate nonnegative matrix factorization , 2007, Comput. Stat. Data Anal..

[27]  Blaise Hanczar,et al.  Ensemble methods for biclustering tasks , 2012, Pattern Recognit..

[28]  Xuan Vinh Nguyen,et al.  Gene Clustering on the Unit Hypersphere with the Spherical K-Means Algorithm: Coping with Extremely Large Number of Local Optima , 2008, BIOCOMP.

[29]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[30]  Daniel Baier,et al.  Two-Mode Overlapping Clustering With Applications to Simultaneous Benefit Segmentation and Market Structuring , 1997 .

[31]  Blaise Hanczar,et al.  Using the bagging approach for biclustering of gene expression data , 2011, Neurocomputing.

[32]  I. Dhillon,et al.  Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Yizhang Jiang,et al.  Enhanced fuzzy partitions vs data randomness in FCM , 2014, J. Intell. Fuzzy Syst..

[34]  Miin-Shen Yang,et al.  Block fuzzy k-modes clustering algorithm , 2009, 2009 IEEE International Conference on Fuzzy Systems.

[35]  Gérard Govaert,et al.  Clustering with block mixture models , 2003, Pattern Recognit..

[36]  Madasu Hanmandlu,et al.  A non-extensive entropy feature and its application to texture classification , 2013, Neurocomputing.

[37]  Neelima Gupta,et al.  MIB: Using mutual information for biclustering gene expression data , 2010, Pattern Recognit..

[38]  Yanchun Zhang,et al.  Co-clustering Analysis of Weblogs Using Bipartite Spectral Projection Approach , 2010, KES.

[39]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[40]  L. Hubert,et al.  Additive two-mode clustering: The error-variance approach revisited , 1995 .

[41]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[42]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[43]  Vichi Maurizio Double k-means Clustering for Simultaneous Classification of Objects and Variables , 2001 .

[44]  William-Chandra Tjhi,et al.  A partitioning based algorithm to fuzzy co-cluster documents and words , 2006, Pattern Recognit. Lett..

[45]  Gérard Govaert La classification croisée , 1989, Monde des Util. Anal. Données.

[46]  P. Orlik,et al.  An error variance approach to two-mode hierarchical clustering , 1993 .

[47]  Srujana Merugu,et al.  A scalable collaborative filtering framework based on co-clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).