Parameter-less co-clustering for star-structured heterogeneous data

The availability of data represented with multiple features coming from heterogeneous domains is getting more and more common in real world applications. Such data represent objects of a certain type, connected to other types of data, the features, so that the overall data schema forms a star structure of inter-relationships. Co-clustering these data involves the specification of many parameters, such as the number of clusters for the object dimension and for all the features domains. In this paper we present a novel co-clustering algorithm for heterogeneous star-structured data that is parameter-less. This means that it does not require either the number of row clusters or the number of column clusters for the given feature spaces. Our approach optimizes the Goodman–Kruskal’s τ, a measure for cross-association in contingency tables that evaluates the strength of the relationship between two categorical variables. We extend τ to evaluate co-clustering solutions and in particular we apply it in a higher dimensional setting. We propose the algorithm CoStar which optimizes τ by a local search approach. We assess the performance of CoStar on publicly available datasets from the textual and image domains using objective external criteria. The results show that our approach outperforms state-of-the-art methods for the co-clustering of heterogeneous data, while it remains computationally efficient.

[1]  Andrzej Jaszkiewicz,et al.  Genetic local search for multi-objective combinatorial optimization , 2022 .

[2]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[3]  Ron Bekkerman,et al.  Multi-modal Clustering for Multimedia Collections , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Tie-Yan Liu,et al.  Star-Structured High-Order Heterogeneous Data Co-clustering Based on Consistent Information Theory , 2006, Sixth International Conference on Data Mining (ICDM'06).

[5]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.

[6]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[7]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[8]  Steffen Bickel,et al.  Multi-view clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[9]  Anirban Dasgupta,et al.  Approximation algorithms for co-clustering , 2008, PODS.

[10]  Lothar Thiele,et al.  A Tutorial on the Performance Assessment of Stochastic Multiobjective Optimizers , 2006 .

[11]  Thomas Stützle,et al.  Stochastic Local Search Algorithms for Multiobjective Combinatorial Optimization , 2006, Handbook of Approximation Algorithms and Metaheuristics.

[12]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[13]  Hector Garcia-Molina,et al.  Clustering the tagged web , 2009, WSDM '09.

[14]  Mario Cortina-Borja,et al.  Handbook of Parametric and Nonparametric Statistical Procedures, 5th edn , 2012 .

[15]  Philip S. Yu,et al.  A probabilistic framework for relational clustering , 2007, KDD '07.

[16]  Luigi Pontieri,et al.  Coclustering Multiple Heterogeneous Domains: Linear Combinations and Agreements , 2010, IEEE Transactions on Knowledge and Data Engineering.

[17]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[18]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[19]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[20]  Ruggero G. Pensa,et al.  Constrained Co-clustering of Gene Expression Data , 2008, SDM.

[21]  Philip S. Yu,et al.  Spectral clustering for multi-type relational data , 2006, ICML.

[22]  Céline Robardet,et al.  Efficient Local Search in Conceptual Clustering , 2001, Discovery Science.

[23]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[24]  Guillaume Cleuziou,et al.  CoFKM: A Centralized Method for Multiple-View Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[25]  Yanhua Chen,et al.  Non-Negative Matrix Factorization for Semisupervised Heterogeneous Data Coclustering , 2010, IEEE Transactions on Knowledge and Data Engineering.

[26]  Thomas Stützle,et al.  A study of stochastic local search algorithms for the biobjective QAP with correlated flow matrices , 2006, Eur. J. Oper. Res..

[27]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[28]  L. A. Goodman,et al.  Measures of Association for Cross Classifications III: Approximate Sampling Theory , 1963 .

[29]  Yanhua Chen,et al.  Semi-supervised Document Clustering with Simultaneous Text Representation and Categorization , 2009, ECML/PKDD.

[30]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[31]  El-Ghazali Talbi,et al.  On dominance-based multiobjective local search: design, implementation and experimental analysis on scheduling and traveling salesman problems , 2012, J. Heuristics.

[32]  Ruggero G. Pensa,et al.  Parameter-Free Hierarchical Co-clustering by n-Ary Splits , 2009, ECML/PKDD.

[33]  Céline Robardet,et al.  Comparison of Three Objective Functions for Conceptual Clustering , 2001, PKDD.