Clustering of samples and variables with mixed-type data

Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.

[1]  Öznur Işçi,et al.  comparison of the most commonly used measures of association for doubly ordered square contingency tables via simulation , 2011, Advances in Methodology and Statistics.

[2]  Alexander Kraskov,et al.  MIC: Mutual Information Based Hierarchical Clustering , 2008, 0809.1605.

[3]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[4]  Miin-Shen Yang,et al.  Fuzzy clustering algorithms for mixed feature variables , 2004, Fuzzy Sets Syst..

[5]  Susan R. Wilson,et al.  Integrative exploration of large high-dimensional datasets , 2018 .

[6]  Hongen Zhang,et al.  caOmicsV: an R package for visualizing multidimensional cancer genomic data , 2016, BMC Bioinformatics.

[7]  N. Higham Computing the nearest correlation matrix—a problem from finance , 2002 .

[8]  R. Gentleman,et al.  Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. , 2004, Blood.

[9]  Dominic Edelmann,et al.  Distance correlation coefficients for Lancaster distributions , 2015, J. Multivar. Anal..

[10]  Maria L. Rizzo,et al.  Brownian distance covariance , 2009, 1010.0297.

[11]  R. Clarke,et al.  Theory and Applications of Correspondence Analysis , 1985 .

[12]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[14]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[15]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[16]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[17]  Zhaohong Deng,et al.  A survey on soft subspace clustering , 2014, Inf. Sci..

[18]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[19]  Xiaobo Guo,et al.  Inferring Nonlinear Gene Regulatory Networks from Gene Expression Data Based on Distance Correlation , 2014, PloS one.

[20]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[21]  Svend Kreiner,et al.  A Coefficient of Association Between Categorical Variables With Partial or Tentative Ordering of Categories , 2009 .

[22]  Jérôme Pagès,et al.  Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data , 2008, Comput. Stat. Data Anal..

[23]  Jing Kong,et al.  Using distance correlation and SS-ANOVA to assess associations of familial relationships, lifestyle factors, diseases, and mortality , 2012, Proceedings of the National Academy of Sciences.

[24]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[25]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[26]  J. Mesirov,et al.  Predicting relapse in patients with medulloblastoma by integrating evidence from clinical and genomic features. , 2011, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[27]  Ayelet T. Lamm,et al.  Function of Cancer Associated Genes Revealed by Modern Univariate and Multivariate Association Tests , 2015, PloS one.

[28]  M. Chavent,et al.  ClustOfVar: An R Package for the Clustering of Variables , 2011, 1112.0295.

[29]  J. Podani Extending Gower's general coefficient of similarity to ordinal characters , 1999 .

[30]  L. A. Goodman Simple Models for the Analysis of Association in Cross-Classifications Having Ordered Categories , 1979 .

[31]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[32]  Brendan McCane,et al.  Distance functions for categorical and mixed variables , 2008, Pattern Recognit. Lett..

[33]  T. O. Nelson,et al.  Measuring ordinal association in situations that contain tied scores. , 1996, Psychological bulletin.

[34]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[35]  J. Ross,et al.  Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[36]  Christian Hennig,et al.  Comparing latent class and dissimilarity based clustering for mixed type variables with application to social stratification , 2010 .

[37]  Gábor J. Székely,et al.  The distance correlation t-test of independence in high dimension , 2013, J. Multivar. Anal..

[38]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[39]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[40]  A. Feuerverger,et al.  A Consistent Test for Bivariate Dependence , 1993 .

[41]  Ann. Probab Distance Covariance in Metric Spaces , 2017 .