Internal Evaluation Measures as Proxies for External Indices in Clustering Gene Expression Data

Several external indices that use information not present in the dataset were shown to be useful for evaluation of representative based clustering algorithms. However, such supervised measures are not directly useful for construction of better clustering algorithms when class labels are not provided. We propose a method for identifying internal cluster evaluation measures that use only information present in the dataset and are related to given external indices. We utilize these internal measures for the construction of representative based clustering algorithms. Both identification and utilization steps of the proposed method are enabled by use of a component-based clustering algorithm design. Experiments on 432 algorithms using gene expression data sets provide evidence that some internal measures could be used as surrogates for external indices proposed in the literature. Moreover, the obtained results suggest that internal measures correlated to selected external indices can guide the algorithms toward significantly better cluster models.

[1]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[2]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[3]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[4]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[5]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[6]  Mauro Leoncini,et al.  K-Boost: A Scalable Algorithm for High-Quality Clustering of Microarray Gene Expression Data , 2009, J. Comput. Biol..

[7]  Victor J. Rayward-Smith,et al.  Internal quality measures for clustering in metric spaces , 2008, Int. J. Bus. Intell. Data Min..

[8]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[9]  Olatz Arbelaitz,et al.  Towards a standard methodology to evaluate internal cluster validity indices , 2011, Pattern Recognit. Lett..

[10]  Kathrin Kirchner,et al.  Reusable components for partitioning clustering algorithms , 2009, Artificial Intelligence Review.

[11]  Paolo Rosso,et al.  Evaluation of Internal Validity Measures in Short-Text Corpora , 2008, CICLing.

[12]  Santanu Kumar Rath,et al.  Gene Expression Analysis Using Clustering , 2009 .