Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data

ABSTRACT Although there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering compositional data (i.e. data whose rows belong to the simplex) remains largely unexplored in cases where the observed value is equal or close to zero for one or more samples. This work is motivated by the analysis of two applications, both focused on the categorization of compositional profiles: (1) identifying groups of co-expressed genes from high-throughput RNA sequencing data, in which a given gene may be completely silent in one or more experimental conditions; and (2) finding patterns in the usage of stations over the course of one week in the Velib' bicycle sharing system in Paris, France. For both of these applications, we make use of appropriately chosen data transformations, including the Centered Log Ratio and a novel extension called the Log Centered Log Ratio, in conjunction with the K-means algorithm. We use a non-asymptotic penalized criterion, whose penalty is calibrated with the slope heuristics, to select the number of clusters. Finally, we illustrate the performance of this clustering strategy, which is implemented in the Bioconductor package coseq, on both the gene expression and bicycle sharing system data.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[3]  Vera Pawlowsky-Glahn,et al.  A Critical Approach to Non-Parametric Classification of Compositional Data , 1998 .

[4]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[5]  The logratio approach to the classification of modern sediments and sedimentary environments in northern South China Sea , 1991 .

[6]  G. Mateu-Figueras,et al.  The normal distribution in some constrained sample spaces , 2008, 0802.2643.

[7]  V. Pawlowsky-Glahn,et al.  Geometric approach to statistical analysis on the simplex , 2001 .

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Juan José Egozcue Rubí,et al.  The normal distribution in some constrained sample spaces , 2013 .

[10]  Aurélie Fischer,et al.  On the number of groups in clustering , 2011 .

[11]  G. Mateu-Figueras,et al.  Isometric Logratio Transformations for Compositional Data Analysis , 2003 .

[12]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[13]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[14]  LEONARD A. BRENNAN,et al.  Misclassified Resource Selection: Compositional Analysis and Unused Habitat , 2007 .

[15]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[16]  Cathy Maugis,et al.  Transformation and model choice for RNA-seq co-expression analysis , 2016, bioRxiv.

[17]  F. Chayes On correlation between variables of constant sum , 1960 .

[18]  Antonella Buccianti,et al.  Compositional changes in a fumarolic field, Vulcano Island, Italy: a statistical case study , 2006, Geological Society, London, Special Publications.

[19]  P. Deb Finite Mixture Models , 2008 .

[20]  Bertrand Michel,et al.  Slope heuristics: overview and implementation , 2011, Statistics and Computing.

[21]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[22]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[23]  Nicholas T. Longford,et al.  Stability of household income in European countries in the 1990s , 2006, Comput. Stat. Data Anal..

[24]  Spurious Clusters in Granulometric Data Caused by Logratio Transformation , 1999 .

[25]  R. P. Chapman,et al.  Log transformations in geochemistry , 1977 .

[26]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[27]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[28]  V. Pawlowsky-Glahn,et al.  Compositional data analysis : theory and applications , 2011 .

[29]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Matko Bosnjak,et al.  REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms , 2011, PloS one.

[31]  Martin Kircher,et al.  Transcriptomes of germinal zones of human and mouse fetal neocortex suggest a role of extracellular matrix in progenitor self-renewal , 2012, Proceedings of the National Academy of Sciences.

[32]  Peter J. Bickel,et al.  The Developmental Transcriptome of Drosophila melanogaster , 2010, Nature.

[33]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[34]  P. Massart,et al.  Minimal Penalties for Gaussian Model Selection , 2007 .

[35]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[36]  Jean-Marie Monnez,et al.  A fast and recursive algorithm for clustering large datasets with k-medians , 2011, Comput. Stat. Data Anal..

[37]  Jonathan Friedman,et al.  Inferring Correlation Networks from Genomic Survey Data , 2012, PLoS Comput. Biol..

[38]  Rabikar Chatterjee,et al.  Analyzing Constant-Sum Multiple Criterion Data: A Segment-level Approach , 1995 .

[39]  Vincent Brault,et al.  Capushe : package de sélection de modèle , 2012 .

[40]  Thomas Lengauer,et al.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure , 2006, Bioinform..

[41]  C. Bouveyron,et al.  The discriminative functional mixture model for a comparative analysis of bike sharing systems , 2016, 1601.07999.

[42]  K. Pearson Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia , 1896 .

[43]  Nicholas J. Aebischer,et al.  Compositional Analysis of Habitat Use From Animal Radio-Tracking Data , 1993 .

[44]  Karl Pearson,et al.  Mathematical contributions to the theory of evolution, On the law of ancestral heredity , 1898, Proceedings of the Royal Society of London.