Two-Tier Mapper, an unbiased topology-based clustering method for enhanced global gene expression analysis

Motivation Unbiased clustering methods are needed to analyze growing numbers of complex data sets. Currently available clustering methods often depend on parameters that are set by the user, they lack stability, and are not applicable to small data sets. To overcome these shortcomings we used topological data analysis, an emerging field of mathematics that can discerns additional feature and discover hidden insights on data sets and has a wide application range. Results We have developed a topology-based clustering method called Two-Tier Mapper (TTMap) for enhanced analysis of global gene expression datasets. First, TTMap discerns divergent features in the control group, adjusts for them, and identifies outliers. Second, the deviation of each test sample from the control group in a high-dimensional space is computed, and the test samples are clustered using a new Mapper-based topological algorithm at two levels: a global tier and local tiers. All parameters are either carefully chosen or data-driven, avoiding any user-induced bias. The method is stable, different datasets can be combined for analysis, and significant subgroups can be identified. It outperforms current clustering methods in sensitivity and stability on synthetic and biological datasets, in particular when sample sizes are small; outcome is not affected by removal of control samples, by choice of normalization, or by subselection of data. TTMap is readily applicable to complex, highly variable biological samples and holds promise for personalized medicine. Availability TTMap is supplied as an R package in Bioconductor. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Ulrike von Luxburg,et al.  Clustering Stability: An Overview , 2010, Found. Trends Mach. Learn..

[2]  A. Snijders,et al.  An interferon signature identified by RNA-sequencing of mammary tissues varies across the estrous cycle and is predictive of metastasis-free survival , 2014, Oncotarget.

[3]  Frédéric Chazal,et al.  An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists , 2017, Frontiers in Artificial Intelligence.

[4]  M. Nicolau,et al.  Head and neck cancer subtypes with biological and clinical relevance: Meta-analysis of gene-expression data , 2015, Oncotarget.

[5]  Elena K. Kandror,et al.  Single-cell topological RNA-Seq analysis reveals insights into cellular differentiation and development , 2017, Nature Biotechnology.

[6]  Jarret Glasscock,et al.  Next-generation transcriptome sequencing of the premenopausal breast epithelium using specimens from a normal human breast tissue bank , 2014, Breast Cancer Research.

[7]  Fionn Murtagh,et al.  Handbook of Cluster Analysis , 2015 .

[8]  G. Carlsson,et al.  Topology of viral evolution , 2013, Proceedings of the National Academy of Sciences.

[9]  G. Carlsson,et al.  Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival , 2011, Proceedings of the National Academy of Sciences.

[10]  Herbert Edelsbrunner,et al.  Computational Topology - an Introduction , 2009 .

[11]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[12]  P. Y. Lum,et al.  Extracting insights from the shape of complex data using topology , 2013, Scientific Reports.

[13]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[14]  Anushya Muruganujan,et al.  PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements , 2016, Nucleic Acids Res..

[15]  Jason W. Osborne,et al.  The power of outliers (and why researchers should ALWAYS check for them) , 2004 .

[16]  Pablo G. Cámara,et al.  Topological methods for genomics: present and future directions. , 2017, Current opinion in systems biology.

[17]  A. Yakovlev,et al.  How high is the level of technical noise in microarray data? , 2007, Biology Direct.

[18]  Seema A. Khan,et al.  RANKL expression in normal and malignant breast tissue responds to progesterone and is up-regulated during the luteal phase , 2014, Breast Cancer Research and Treatment.

[19]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[20]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[21]  Paweł Dłotko,et al.  Quantifying similarity of pore-geometry in nanoporous materials , 2017, Nature Communications.

[22]  Gunnar E. Carlsson,et al.  Topology and data , 2009 .

[23]  Adam R Ferguson,et al.  Topological data analysis for discovery in preclinical spinal cord injury and traumatic brain injury , 2015, Nature Communications.

[24]  Thomas R Cox,et al.  LOXL2 induces aberrant acinar morphogenesis via ErbB2 signaling , 2013, Breast Cancer Research.