Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization

Single-cell RNA-Sequencing (scRNA-Seq) is a fast-evolving technology that enables the understanding of biological processes at an unprecedentedly high resolution. However, well-suited bioinformatics tools to analyze the data generated from this new technology are still lacking. Here we investigate the performance of non-negative matrix factorization (NMF) method to analyze a wide variety of scRNA-Seq datasets, ranging from mouse hematopoietic stem cells to human glioblastoma data. In comparison to other unsupervised clustering methods including K-means and hierarchical clustering, NMF has higher accuracy in separating similar groups in various datasets. We ranked genes by their importance scores (D-scores) in separating these groups, and discovered that NMF uniquely identifies genes expressed at intermediate levels as top-ranked genes. Finally, we show that in conjugation with the modularity detection method FEM, NMF reveals meaningful protein-protein interaction modules. In summary, we propose that NMF is a desirable method to analyze heterogeneous single-cell RNA-Seq data. The NMF based subpopulation detection package is available at: https://github.com/lanagarmire/NMFEM.

[1]  Travers Ching,et al.  Single-Cell Transcriptomics Bioinformatics and Computational Challenges , 2016, Front. Genet..

[2]  David A. Knowles,et al.  Batch effects and the effective design of single-cell gene expression studies , 2016, Scientific Reports.

[3]  Junxia Chen,et al.  Comprehensive analysis of differentially expressed profiles of lncRNAs and circRNAs with associated co-expression and ceRNA networks in bladder carcinoma , 2016, Oncotarget.

[4]  D. Grimm,et al.  Identifications of novel mechanisms in breast cancer cells involving duct-like multicellular spheroid formation after exposure to the Random Positioning Machine , 2016, Scientific Reports.

[5]  Paolo Aretini,et al.  The combination of four molecular markers improves thyroid cancer cytologic diagnosis and patient management , 2015, BMC Cancer.

[6]  Joseph L. Herman,et al.  Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis , 2015, Nature Methods.

[7]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[8]  Zhigang Luo,et al.  Gene Ranking of RNA-Seq Data via Discriminant Non-Negative Matrix Factorization , 2015, PloS one.

[9]  Xianyong Ma,et al.  Malat1 as an evolutionarily conserved lncRNA, plays a positive role in regulating proliferation and maintaining undifferentiated status of early-stage hematopoietic cells , 2015, BMC Genomics.

[10]  Mingxiang Teng,et al.  On the widespread and critical impact of systematic bias and batch effects in single-cell RNA-Seq data , 2015 .

[11]  Bo Ding,et al.  Normalization and noise reduction for single cell RNA-seq experiments , 2015, Bioinform..

[12]  Zhi-Qiang Jia,et al.  Discriminant Non-Negative Matrix Factorization , 2015 .

[13]  F. Ginhoux,et al.  Identification of cDC1- and cDC2-committed DC progenitors reveals early lineage priming at the common DC progenitor stage in the bone marrow , 2015, Nature Immunology.

[14]  P. Zhu,et al.  C8orf4 negatively regulates self-renewal of liver cancer stem cells via suppression of NOTCH2 signalling , 2015, Nature Communications.

[15]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[16]  Peng Zhang,et al.  The high expression of TC1 (C8orf4) was correlated with the expression of β-catenin and cyclin D1 and the progression of squamous cell carcinomas of the tongue , 2015, Tumor Biology.

[17]  S. Linnarsson,et al.  Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq , 2015, Science.

[18]  Ying Liu,et al.  Radioactive 125I seeds inhibit cell growth and epithelial-mesenchymal transition in human glioblastoma multiforme via a ROS-mediated signaling pathway , 2015, BMC Cancer.

[19]  G. Su,et al.  Long noncoding RNA MALAT1 associates with the malignant status and poor prognosis in glioma , 2015, Tumor Biology.

[20]  Fabian J Theis,et al.  Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells , 2015, Nature Biotechnology.

[21]  A. Mobasheri,et al.  The crossroads between cancer stem cells and aging , 2015, BMC Cancer.

[22]  S. Linnarsson,et al.  Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing , 2014, Nature Neuroscience.

[23]  A. Oudenaarden,et al.  Genome-wide RNA Tomography in the Zebrafish Embryo , 2014, Cell.

[24]  Aviv Regev,et al.  Deconstructing transcriptional heterogeneity in pluripotent stem cells , 2014, Nature.

[25]  S. Dudoit,et al.  Normalization of RNA-seq data using factor analysis of control genes or samples , 2014, Nature Biotechnology.

[26]  A. Teschendorff,et al.  A systems-level integrative framework for genome-wide DNA methylation and gene expression data identifies differential gene expression modules under epigenetic control , 2014, Bioinform..

[27]  F. Biase,et al.  Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing , 2014, Genome research.

[28]  S. Potter,et al.  Single cell dissection of early kidney development: multilineage priming , 2014, Development.

[29]  Shawn M. Gillespie,et al.  Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma , 2014, Science.

[30]  P. Kharchenko,et al.  Bayesian approach to single-cell differential expression analysis , 2014, Nature Methods.

[31]  Xinghua Pan Single Cell Analysis: From Technology to Biology and Medicine , 2014, Single cell biology.

[32]  N. Neff,et al.  Reconstructing lineage hierarchies of the distal lung epithelium using single cell RNA-seq , 2014, Nature.

[33]  Cole Trapnell,et al.  Pseudo-temporal ordering of individual cells reveals dynamics and regulators of cell fate decisions , 2014, Nature Biotechnology.

[34]  David Haussler,et al.  The UCSC Genome Browser database: 2014 update , 2013, Nucleic Acids Res..

[35]  Aleksandra A. Kolodziejczyk,et al.  Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[36]  Christina E. Wells,et al.  HDAC3 is essential for DNA replication in hematopoietic progenitor cells. , 2013, The Journal of clinical investigation.

[37]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[38]  A. Ngom,et al.  The non-negative matrix factorization toolbox for biological data mining , 2013, Source Code for Biology and Medicine.

[39]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[40]  Laurens van der Maaten,et al.  Barnes-Hut-SNE , 2013, ICLR.

[41]  Y. Kafai Discussion of Conclusions , 2012 .

[42]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[43]  Andrew McDavid,et al.  Data exploration, quality control and testing in single-cell qPCR-based gene expression experiments , 2012, Bioinform..

[44]  Fang-Xiang Wu,et al.  Dynamic miRNA-TF-mRNA circuits in mouse lung development , 2012, 2012 IEEE 6th International Conference on Systems Biology (ISB).

[45]  O. Klein,et al.  A reserve stem cell population in small intestine renders Lgr5-positive cells dispensable , 2011, Nature.

[46]  James Briscoe,et al.  An intuitive graphical visualization technique for the interrogation of transcriptome data , 2011, Nucleic acids research.

[47]  Erkki Oja,et al.  Kullback-Leibler Divergence for Nonnegative Matrix Factorization , 2011, ICANN.

[48]  F. Fuller-Pace,et al.  An evolutionarily conserved, alternatively spliced, intron in the p68/DDX5 DEAD-box RNA helicase gene encodes a novel miRNA. , 2011, RNA.

[49]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[50]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[51]  Renaud Gaujoux,et al.  A flexible R package for nonnegative matrix factorization , 2010, BMC Bioinformatics.

[52]  Davis J. McCarthy,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[53]  Yingdong Zhao,et al.  Non-negative matrix factorization of gene expression profiles: a plug-in for BRB-ArrayTools , 2009, Bioinform..

[54]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[55]  Vishal Monga,et al.  Robust and Secure Image Hashing via Non-Negative Matrix Factorizations , 2007, IEEE Transactions on Information Forensics and Security.

[56]  Annabel N. Smith,et al.  Molecular cloning and characterization of a novel form of the human vacuolar H+-ATPase e-subunit: an essential proton pump component. , 2007, Gene.

[57]  J. Mesirov,et al.  Metagene projection for cross-platform, cross-species characterization of global transcriptional states , 2007, Proceedings of the National Academy of Sciences.

[58]  J. Reichardt,et al.  Statistical mechanics of community detection. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[59]  G. Church,et al.  Improving molecular cancer class discovery through sparse non-negative matrix factorization , 2005, Bioinform..

[60]  Jiwang Zhang,et al.  BMP signaling and stem cell regulation. , 2005, Developmental biology.

[61]  Erkki Oja,et al.  Projective Nonnegative Matrix Factorization for Image Compression and Feature Extraction , 2005, SCIA.

[62]  Jagath C. Rajapakse,et al.  Color channel encoding with NMF for face recognition , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[63]  Paris Smaragdis Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs , 2004, ICA.

[64]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[65]  K. Akashi,et al.  Unraveling the molecular components and genetic blueprints of stem cells. , 2003, BioTechniques.

[66]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[67]  Irving L. Weissman,et al.  Bmi-1 is required for maintenance of adult self-renewing haematopoietic stem cells , 2003, Nature.

[68]  D. Karolchik,et al.  The UCSC Genome Browser , 2003, Current protocols in bioinformatics.

[69]  W. J. Kent,et al.  The UCSC Genome Browser , 2003, Current protocols in bioinformatics.

[70]  G C Overton,et al.  The genetic program of hematopoietic stem cells. , 2000, Science.

[71]  Stephen V. Stehman,et al.  Selecting and interpreting measures of thematic classification accuracy , 1997 .

[72]  P. Quesenberry,et al.  The ski/sno protooncogene family in hematopoietic development. , 1995, Blood.

[73]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 1971, Scientific Reports.

[74]  Shawn M. Gillespie,et al.  Patel glioblastoma Single-cell RNA-seq highlights intratumoral heterogeneity in primary , 2014 .

[75]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[76]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[77]  Hyunsoo Kim,et al.  Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares , 2006 .

[78]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[79]  Mikhail Nikulin,et al.  Chi-squared goodness-of-fit test for the family of logistic distributions , 1994, Kybernetika.