OsamorSoft: clustering index for comparison and quality validation in high throughput dataset

The existence of some differences in the results obtained from varying clustering k-means algorithms necessitated the need for a simplified approach in validation of cluster quality obtained. This is partly because of differences in the way the algorithms select their first seed or centroid either randomly, sequentially or some other principles influences which tend to influence the final result outcome. Popular external cluster quality validation and comparison models require the computation of varying clustering indexes such as Rand, Jaccard, Fowlkes and Mallows, Morey and Agresti Adjusted Rand Index (ARI MA ) and Hubert and Arabie Adjusted Rand Index (ARI HA ). In literature, Hubert and Arabie Adjusted Rand Index (ARI HA ) has been adjudged as a good measure of cluster validity. Based on ARI HA as a popular clustering quality index, we developed OsamorSoft which constitutes DNA_Omatrix and OsamorSpreadSheet as a tool for cluster quality validation in high throughput analysis. The proposed method will help to bridge the yawning gap created by lesser number of friendly tools available to externally evaluate the ever-increasing number of clustering algorithms. Our implementation was tested alongside with clusters created with four k-means algorithms using malaria microarray data. Furthermore, our results evolved a compact 4-stage OsamorSpreadSheet statistics that our easy-to-use GUI java and spreadsheet-based tool of OsamorSoft uses for cluster quality comparison. It is recommended that a framework be evolved to facilitate the simplified integration and automation of several other cluster validity indexes for comparative analysis of big data problems.

[1]  Qiang Zhang,et al.  Efficient synthetical clustering validity indexes for hierarchical clustering , 2020, Expert Syst. Appl..

[2]  L. S. Callahan,et al.  Gradient Material Strategies for Hydrogel Optimization in Tissue Engineering Applications. , 2018 .

[3]  Hua Fang,et al.  Multiple-vs Non-or Single-Imputation Based Fuzzy Clustering for Incomplete Longitudinal Behavioral Intervention Data , 2016, 2016 IEEE First International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE).

[4]  V.S. Tseng,et al.  Efficiently mining gene expression data via a novel parameterless clustering method , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[7]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[8]  Alan Agresti,et al.  The Measurement of Classification Agreement: An Adjustment to the Rand Statistic for Chance Agreement , 1984 .

[9]  Li Guo-Min,et al.  A Pilot Pattern Based Algorithm for MIMO-OFDM Channel Estimation , 2016, 2016 International Symposium on Computer, Consumer and Control (IS3C).

[10]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[11]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[12]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[13]  Mark J. Embrechts,et al.  On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification , 2009, ICANN.

[14]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[15]  Maria Pia Fantini,et al.  Does Pet Ownership in Infancy Lead to Asthma or Allergy at School Age? Pooled Analysis of Individual Participant Data from 11 European Birth Cohorts , 2012, PloS one.

[16]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[17]  David M. Mount,et al.  Approximation algorithm for the kinetic robust K-center problem , 2010, Comput. Geom..

[18]  M. C. Ortiz,et al.  Selecting variables for k-means cluster analysis by using a genetic algorithm that optimises the silhouettes , 2004 .

[19]  Valeria D'Argenio,et al.  The High-Throughput Analyses Era: Are We Ready for the Data Struggle? , 2018, High-throughput.

[20]  Robert Saltstone,et al.  A computer program to calculate Hubert and Arabie's adjusted rand index , 1996 .

[21]  G. de los Campos,et al.  Microarray Gene Expression Dataset Re-analysis Reveals Variability in Influenza Infection and Vaccination , 2019, bioRxiv.

[22]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[23]  V. Batagelj,et al.  Comparing resemblance measures , 1995 .

[24]  Victor Chukwudi Osamor,et al.  Reducing the Time Requirement of k-Means Algorithm , 2012, PloS one.

[25]  M. Cugmas,et al.  On comparing partitions , 2015 .

[26]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[27]  J. Barker,et al.  Large-scale temporal gene expression mapping of central nervous system development. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[28]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[29]  Matthijs J. Warrens,et al.  On the Equivalence of Cohen’s Kappa and the Hubert-Arabie Adjusted Rand Index , 2008, J. Classif..

[30]  F. Marriott Practical problems in a method of cluster analysis. , 1971, Biometrics.

[31]  Patricia De la Vega,et al.  Discovery of Gene Function by Expression Profiling of the Malaria Parasite Life Cycle , 2003, Science.

[32]  M MountDavid,et al.  A local search approximation algorithm for k-means clustering , 2004 .

[33]  Verónica Bolón-Canedo,et al.  A Review of Microarray Datasets: Where to Find Them and Specific Characteristics. , 2019, Methods in molecular biology.

[34]  J. Derisi,et al.  The Transcriptome of the Intraerythrocytic Developmental Cycle of Plasmodium falciparum , 2003, PLoS biology.

[35]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[36]  J. Gower,et al.  Metric and Euclidean properties of dissimilarity coefficients , 1986 .

[37]  Teh Ying Wah,et al.  A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data , 2015, PloS one.

[38]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[40]  Ka Yee Yeung,et al.  Details of the Adjusted Rand index and Clustering algorithms Supplement to the paper “ An empirical study on Principal Component Analysis for clustering gene expression data ” ( to appear in Bioinformatics ) , 2001 .

[41]  Indranil Mukhopadhyay,et al.  Tight clustering for large datasets with an application to gene expression data , 2019, Scientific Reports.

[42]  Ahmed Albatineh,et al.  On Similarity Indices and Correction for Chance Agreement , 2006, J. Classif..

[43]  Cesar H. Comin,et al.  Clustering algorithms: A comparative approach , 2016, PloS one.

[44]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[45]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[46]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[47]  Victor Osamor,et al.  Comparative Functional Classification of Plasmodium falciparum Genes Using k-Means Clustering , 2009, 2009 International Association of Computer Science and Information Technology - Spring Conference.

[48]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[49]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[50]  Tommi Kärkkäinen,et al.  Comparison of Internal Clustering Validation Indices for Prototype-Based Clustering , 2017, Algorithms.

[51]  P. Green,et al.  A Generalized Rand-Index Method for Consensus Clustering of Separate Partitions of the Same Data Base , 1999 .

[52]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[53]  L C Morey,et al.  A Comparison of Cluster Analysis Techniques Withing a Sequential Validation Framework. , 1983, Multivariate behavioral research.

[54]  Abdel-Badeeh M. Salem,et al.  An efficient enhanced k-means clustering algorithm , 2006 .

[55]  F. Cohen,et al.  Expression profiling of the schizont and trophozoite stages of Plasmodium falciparum with a long-oligonucleotide microarray , 2003, Genome Biology.

[56]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[57]  B. Lown,et al.  Measuring compassionate healthcare with the 12-item Schwartz Center Compassionate Care Scale , 2019, PloS one.

[58]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[59]  Burak Eksioglu,et al.  Clustering of high throughput gene expression data , 2012, Comput. Oper. Res..