Random forest based similarity learning for single cell RNA sequencing data

Motivation Genome‐wide transcriptome sequencing applied to single cells (scRNA‐seq) is rapidly becoming an assay of choice across many fields of biological and biomedical research. Scientific objectives often revolve around discovery or characterization of types or sub‐types of cells, and therefore, obtaining accurate cell‐cell similarities from scRNA‐seq data is a critical step in many studies. While rapid advances are being made in the development of tools for scRNA‐seq data analysis, few approaches exist that explicitly address this task. Furthermore, abundance and type of noise present in scRNA‐seq datasets suggest that application of generic methods, or of methods developed for bulk RNA‐seq data, is likely suboptimal. Results Here, we present RAFSIL, a random forest based approach to learn cell‐cell similarities from scRNA‐seq data. RAFSIL implements a two‐step procedure, where feature construction geared towards scRNA‐seq data is followed by similarity learning. It is designed to be adaptable and expandable, and RAFSIL similarities can be used for typical exploratory data analysis tasks like dimension reduction, visualization and clustering. We show that our approach compares favorably with current methods across a diverse collection of datasets, and that it can be used to detect and highlight unwanted technical variation in scRNA‐seq datasets in situations where other methods fail. Overall, RAFSIL implements a flexible approach yielding a useful tool that improves the analysis of scRNA‐seq data. Availability and implementation The RAFSIL R package is available at www.kostkalab.net/software.html

[1]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[2]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[3]  S. Potter,et al.  Psychrophilic proteases dramatically reduce single-cell RNA-seq artifacts: a molecular atlas of kidney development , 2017, Development.

[4]  S. Horvath,et al.  Global histone modification patterns predict risk of prostate cancer recurrence , 2005, Nature.

[5]  Hans Clevers,et al.  Single-cell messenger RNA sequencing reveals rare intestinal cell types , 2015, Nature.

[6]  Robert Gentleman,et al.  Distance Measures in DNA Microarray Data Analysis , 2005 .

[7]  Pavithra Kumar,et al.  Understanding development and stem cells using single cell-based analyses of gene expression , 2017, Development.

[8]  N. Neff,et al.  Reconstructing lineage hierarchies of the distal lung epithelium using single cell RNA-seq , 2014, Nature.

[9]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[10]  Hui Wang,et al.  SINCERA: A Pipeline for Single-Cell RNA-Seq Profiling Analysis , 2015, PLoS Comput. Biol..

[11]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[12]  M. Elowitz,et al.  Challenges and emerging directions in single-cell analysis , 2017, Genome Biology.

[13]  Fabian J Theis,et al.  Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells , 2015, Nature Biotechnology.

[14]  M. Schaub,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.

[15]  Valentine Svensson,et al.  Power Analysis of Single Cell RNA-Sequencing Experiments , 2016, Nature Methods.

[16]  Shawn M. Gillespie,et al.  Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma , 2014, Science.

[17]  Mehrdad Nourani,et al.  Clustering Single-Cell Expression Data Using Random Forest Graphs , 2017, IEEE Journal of Biomedical and Health Informatics.

[18]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[19]  J. Marioni,et al.  Heterogeneity in Oct4 and Sox2 Targets Biases Cell Fate in 4-Cell Mouse Embryos , 2016, Cell.

[20]  M. Cugmas,et al.  On comparing partitions , 2015 .

[21]  Z. Bar-Joseph,et al.  Using neural networks for reducing the dimensions of single-cell RNA-Seq data , 2017, Nucleic acids research.

[22]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[23]  Alex A. Pollen,et al.  Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex , 2014, Nature Biotechnology.

[24]  S. Linnarsson,et al.  Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing , 2014, Nature Neuroscience.

[25]  Oscope: a statistical pipeline for identifying oscillatory genes in unsynchronized single cell RNA-seq experiments , 2016 .

[26]  Martin T. Hagan,et al.  Neural network design , 1995 .

[27]  T. Crowther,et al.  Detecting macroecological patterns in bacterial communities across independent studies of 1 global soils 2 , 2017 .

[28]  Christopher Yau,et al.  pcaReduce: hierarchical clustering of single cell transcriptional profiles , 2015, BMC Bioinformatics.

[29]  S. Horvath,et al.  Unsupervised Learning With Random Forest Predictors , 2006 .

[30]  R. L. Thorndike Who belongs in the family? , 1953 .

[31]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[32]  Michael I. Jordan,et al.  Cluster Forests , 2011, Comput. Stat. Data Anal..

[33]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[34]  Joachim Selbig,et al.  pcaMethods - a bioconductor package providing PCA methods for incomplete data , 2007, Bioinform..

[35]  D. Mock,et al.  Innate-like functions of natural killer T cell subsets result from highly divergent gene programs , 2016, Nature Immunology.

[36]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[37]  Bo Wang,et al.  Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning , 2016, Nature Methods.

[38]  Aleksandra A. Kolodziejczyk,et al.  Single Cell RNA-Sequencing of Pluripotent States Unlocks Modular Transcriptional Variation , 2015, Cell stem cell.

[39]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[40]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .