Assessing Dissimilarity Measures for Sample-Based Hierarchical Clustering of RNA Sequencing Data Using Plasmode Datasets

Sample- and gene- based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure. We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.

[1]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[2]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[3]  HandlJulia,et al.  Computational cluster validation in post-genomic data analysis , 2005 .

[4]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[5]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.

[6]  Renée X. de Menezes,et al.  Filtering, FDR and power , 2010, BMC Bioinformatics.

[7]  Peng Liu,et al.  Model-based clustering for RNA-seq data , 2014, Bioinform..

[8]  Xiangfeng Wang,et al.  Application of the Gini Correlation Coefficient to Infer Regulatory Relationships in Transcriptome Analysis[W][OA] , 2012, Plant Physiology.

[9]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[10]  David B. Allison,et al.  The use of plasmodes as a supplement to simulations: A simple example evaluating individual admixture estimation methodologies , 2009, Comput. Stat. Data Anal..

[11]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[12]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[13]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[14]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[15]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[16]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[17]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[18]  Wolfgang Huber,et al.  Love MI, Huber W, Anders S.. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol 15: 550 , 2014 .

[19]  Guilherme J M Rosa,et al.  A powerful and flexible linear mixed model framework for the analysis of relative quantification RT-PCR data. , 2009, Genomics.

[20]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[21]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[22]  Hui Xiong,et al.  Clustering Validation Measures , 2018, Data Clustering: Algorithms and Applications.

[23]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[24]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[25]  Daniela M. Witten,et al.  Classification and clustering of sequencing data using a poisson model , 2011, 1202.6201.

[26]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[27]  R. Sokal,et al.  THE COMPARISON OF DENDROGRAMS BY OBJECTIVE METHODS , 1962 .

[28]  David B. Allison,et al.  Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates , 2008, PLoS genetics.

[29]  Sanjay Joshua Swamidass,et al.  Accounting for noise when clustering biological data , 2012, Briefings Bioinform..

[30]  Gilles Celeux,et al.  Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models , 2015, Bioinform..

[31]  R. Cattell,et al.  A general plasmode (No. 30-10-5-2) for factor analytic exercises and research. , 1967 .

[32]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[33]  David B Allison,et al.  Publishes Results of a Wide Variety of Studies from Human and from Informative Model Systems with Physiological Genomics , 2008 .

[34]  L. Pachter Models for transcript quantification from RNA-Seq , 2011, 1104.3889.

[35]  Rex T. Nelson,et al.  RNA-Seq Atlas of Glycine max: A guide to the soybean transcriptome , 2010, BMC Plant Biology.

[36]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques , 2008 .

[37]  Peng Liu,et al.  Cluster Analysis of RNA-Sequencing Data , 2014 .

[38]  Gilles Celeux,et al.  Data-based filtering for replicated high-throughput transcriptome sequencing experiments , 2013, Bioinform..

[39]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[40]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning , 2008 .

[41]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[42]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[43]  Daniel Bottomly,et al.  Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays , 2011, PloS one.

[44]  N G Waller,et al.  A Method for Generating Simulated Plasmodes and Artificial Test Clusters with User-Defined Shape, Size, and Orientation. , 1999, Multivariate behavioral research.

[45]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[46]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[47]  J. Steibel,et al.  Genome-Wide Linkage Analysis of Global Gene Expression in Loin Muscle Tissue Identifies Candidate Genes in Pigs , 2011, PloS one.

[48]  Marcel Brun,et al.  Clustering Algorithms: On Learning, Validation, Performance, and Applications to Genomics , 2009, Current genomics.

[49]  M. Kendall,et al.  Rank Correlation Methods (5th ed.). , 1992 .

[50]  Mark D. Robinson,et al.  Robustly detecting differential expression in RNA sequencing data using observation weights , 2013, Nucleic acids research.

[51]  D. Allison,et al.  Towards sound epistemological foundations of statistical methods for high-dimensional biology , 2004, Nature Genetics.

[52]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[53]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[54]  Pablo D. Reeb,et al.  Evaluating statistical analysis models for RNA sequencing experiments , 2013, Front. Genet..

[55]  D. Allison,et al.  Challenges and approaches to statistical design and inference in high-dimensional investigations. , 2009, Methods in molecular biology.