scHinter: imputing dropout events for single-cell RNA-seq data with limited sample size

MOTIVATION Single-cell RNA-sequencing (scRNA-seq) is fast becoming a powerful technique for studying dynamic gene regulation at unprecedented resolution. However, scRNA-seq data suffer from problems of extremely high dropout rate and cell-to-cell variability, demanding new methods to recover gene expression loss. Despite the availability of various dropout imputation approaches for scRNA-seq, most studies focus on data with a medium or large number of cells, while few studies have explicitly investigated the differential performance across different sample sizes or the applicability of the approach on small or imbalanced data. It is imperative to develop new imputation approaches with higher generalizability for data with various sample sizes. RESULTS We proposed a method called scHinter for imputing dropout events for scRNA-seq with special emphasis on data with limited sample size. scHinter incorporates a voting-based ensemble distance and leverages the synthetic minority over-sampling technique for random interpolation. A hierarchical framework is also embedded in scHinter to increase the reliability of the imputation for small samples. We demonstrated the ability of scHinter to recover gene expression measurements across a wide spectrum of scRNA-seq datasets with varied sample sizes. We comprehensively examined the impact of sample size and cluster number on imputation. Comprehensive evaluation of scHinter across diverse scRNA-seq datasets with imbalanced or limited sample size showed that scHinter achieved higher and more robust performance than competing approaches, including MAGIC, scImpute, SAVER, and netSmooth. AVAILABILITY Freely available for download at https://github.com/BMILAB/scHinter. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Andrew Copas,et al.  Methods for sample size determination in cluster randomized trials , 2015, International journal of epidemiology.

[3]  P. Sinha,et al.  Proteomics for studying cancer cells and the development of chemoresistance , 2001, Proteomics.

[4]  L. Zon,et al.  Hematopoiesis: An Evolving Paradigm for Stem Cell Biology , 2008, Cell.

[5]  Hongkai Ji,et al.  TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis , 2016, Nucleic acids research.

[6]  M. Hemberg,et al.  scmap: projection of single-cell RNA-seq data across data sets , 2018, Nature Methods.

[7]  Maryam Imani,et al.  Feature Extraction Using Attraction Points for Classification of Hyperspectral Images in a Small Sample Size Situation , 2014, IEEE Geoscience and Remote Sensing Letters.

[8]  John R. Garbe,et al.  A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa , 2018, PLoS Comput. Biol..

[9]  Guy N. Brock,et al.  clValid , an R package for cluster validation , 2008 .

[10]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Cole Trapnell,et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells , 2014, Nature Biotechnology.

[13]  Altuna Akalin,et al.  netSmooth: Network-smoothing based imputation for single cell RNA-seq , 2017, bioRxiv.

[14]  Atul J. Butte,et al.  Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage , 2018, Nature Immunology.

[15]  Guocheng Yuan,et al.  GiniClust: detecting rare cell types from single-cell gene expression data with Gini index , 2016, Genome Biology.

[16]  Aviv Regev,et al.  Deconstructing transcriptional heterogeneity in pluripotent stem cells , 2014, Nature.

[17]  A. Krešo,et al.  Evolution of the cancer stem cell model. , 2014, Cell stem cell.

[18]  Laura M. Stapleton,et al.  The Effect of Small Sample Size on Two-Level Model Estimates: A Review and Illustration , 2014, Educational Psychology Review.

[19]  Teh Ying Wah,et al.  A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data , 2015, PloS one.

[20]  G. Hansmann,et al.  Design and validation of an endothelial progenitor cell capture chip and its application in patients with pulmonary arterial hypertension , 2011, Journal of Molecular Medicine.

[21]  Nancy R. Zhang,et al.  SAVER: Gene expression recovery for single-cell RNA sequencing , 2018, Nature Methods.

[22]  Lihua Zhang,et al.  Comparison of Computational Methods for Imputing Single-Cell RNA-Sequencing Data , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  A. Oudenaarden,et al.  Validation of noise models for single-cell transcriptomics , 2014, Nature Methods.

[24]  W. El-Deiry,et al.  Identification and enumeration of circulating tumor cells in the cerebrospinal fluid of breast cancer patients with central nervous system metastases , 2011, Oncotarget.

[25]  L. J. K. Wee,et al.  Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors , 2017, Nature Genetics.

[26]  Kevin R. Moon,et al.  Recovering Gene Interactions from Single-Cell Data Using Data Diffusion , 2018, Cell.

[27]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[28]  M. Robinson,et al.  A systematic performance evaluation of clustering methods for single-cell RNA-seq data. , 2018, F1000Research.

[29]  E. Marcotte,et al.  Prioritizing candidate disease genes by network-based boosting of genome-wide association data. , 2011, Genome research.

[30]  N. Hacohen,et al.  Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors , 2017, Science.

[31]  A. Seifalian,et al.  A novel method for the extraction and culture of progenitor stem cells from human peripheral blood for use in regenerative medicine , 2011, Biotechnology and applied biochemistry.

[32]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[33]  I. Hellmann,et al.  Comparative Analysis of Single-Cell RNA Sequencing Methods , 2016, bioRxiv.

[34]  Wei Vivian Li,et al.  An accurate and robust imputation method scImpute for single-cell RNA-seq data , 2018, Nature Communications.

[35]  Hans Clevers,et al.  Single-cell messenger RNA sequencing reveals rare intestinal cell types , 2015, Nature.

[36]  Kui Zhang,et al.  Practical Consideration of Genotype Imputation: Sample Size, Window Size, Reference Choice, and Untyped Rate. , 2011, Statistics and its interface.