High-Order Correlation Integration for Single-Cell or Bulk RNA-seq Data Analysis

Quantifying or labeling the sample type with high quality is a challenging task, which is a key step for understanding complex diseases. Reducing noise pollution to data and ensuring the extracted intrinsic patterns in concordance with the primary data structure are important in sample clustering and classification. Here we propose an effective data integration framework named as HCI (High-order Correlation Integration), which takes an advantage of high-order correlation matrix incorporated with pattern fusion analysis (PFA), to realize high-dimensional data feature extraction. On the one hand, the high-order Pearson's correlation coefficient can highlight the latent patterns underlying noisy input datasets and thus improve the accuracy and robustness of the algorithms currently available for sample clustering. On the other hand, the PFA can identify intrinsic sample patterns efficiently from different input matrices by optimally adjusting the signal effects. To validate the effectiveness of our new method, we firstly applied HCI on four single-cell RNA-seq datasets to distinguish the cell types, and we found that HCI is capable of identifying the prior-known cell types of single-cell samples from scRNA-seq data with higher accuracy and robustness than other methods under different conditions. Secondly, we also integrated heterogonous omics data from TCGA datasets and GEO datasets including bulk RNA-seq data, which outperformed the other methods at identifying distinct cancer subtypes. Within an additional case study, we also constructed the mRNA-miRNA regulatory network of colorectal cancer based on the feature weight estimated from HCI, where the differentially expressed mRNAs and miRNAs were significantly enriched in well-known functional sets of colorectal cancer, such as KEGG pathways and IPA disease annotations. All these results supported that HCI has extensive flexibility and applicability on sample clustering with different types and organizations of RNA-seq data.

[1]  P. Laird,et al.  Discovery of multi-dimensional modules by integrative analysis of cancer genomic data , 2012, Nucleic acids research.

[2]  R. Sandberg,et al.  Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells , 2014, Science.

[3]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[4]  Joshua Starmer,et al.  Sex bias in CNS autoimmune disease mediated by androgen control of autoimmune regulator , 2016, Nature Communications.

[5]  S. Horvath,et al.  Evidence for anti-Burkitt tumour globulins in Burkitt tumour patients and healthy individuals. , 1967, British Journal of Cancer.

[6]  A. Murphy,et al.  RNA Sequencing of Single Human Islet Cells Reveals Type 2 Diabetes Genes. , 2016, Cell metabolism.

[7]  J. Rodgers,et al.  Thirteen ways to look at the correlation coefficient , 1988 .

[8]  Bradley P. Coe,et al.  Integrating the multiple dimensions of genomic and epigenomic landscapes of cancer , 2010, Cancer and Metastasis Reviews.

[9]  Lin Li,et al.  Cell-specific network constructed by single-cell RNA sequencing data , 2019, Nucleic acids research.

[10]  Kazuyuki Aihara,et al.  Detection for disease tipping points by landscape dynamic network biomarkers , 2018, National science review.

[11]  S. Horvath,et al.  Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing , 2013, Nature.

[12]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[13]  Kazuyuki Aihara,et al.  Quantifying critical states of complex diseases using single-sample dynamic network biomarkers , 2017, PLoS Comput. Biol..

[14]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[15]  Chen Li,et al.  Dysfunction of PLA2G6 and CYP2C44-associated network signals imminent carcinogenesis from chronic inflammation to hepatocellular carcinoma , 2017, Journal of molecular cell biology.

[16]  Qian Qian,et al.  Unravelling miRNA regulation in yield of rice (Oryza sativa) based on differential network model , 2018, Scientific Reports.

[17]  C. Greenwood,et al.  Data Integration in Genetics and Genomics: Methods and Challenges , 2009, Human genomics and proteomics : HGP.

[18]  Martha L Slattery,et al.  MAP kinase genes and colon and rectal cancer. , 2012, Carcinogenesis.

[19]  Alex A. Pollen,et al.  Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex , 2014, Nature Biotechnology.

[20]  Luonan Chen,et al.  Quantifying Waddington’s epigenetic landscape: a comparison of single-cell potency measures , 2018, bioRxiv.

[21]  Shi-Hua Zhang,et al.  Integrative analysis for identifying joint modular patterns of gene-expression and drug-response data , 2016, Bioinform..

[22]  S. Gygi,et al.  Quantitative analysis of complex protein mixtures using isotope-coded affinity tags , 1999, Nature Biotechnology.

[23]  Hao Jiang,et al.  Single cell clustering based on cell‐pair differentiability correlation and variance analysis , 2018, Bioinform..

[24]  Aleksandra A. Kolodziejczyk,et al.  The technology and biology of single-cell RNA sequencing. , 2015, Molecular cell.

[25]  Luonan Chen,et al.  Part mutual information for quantifying direct associations in networks , 2016, Proceedings of the National Academy of Sciences.

[26]  Meiyi Li,et al.  Dynamic network biomarker indicates pulmonary metastasis at the tipping point of hepatocellular carcinoma , 2018, Nature Communications.

[27]  Juan Liu,et al.  A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules , 2011, Bioinform..

[28]  M. Hemberg,et al.  scmap: projection of single-cell RNA-seq data across data sets , 2018, Nature Methods.

[29]  Xiaoping Liu,et al.  Diagnosing phenotypes of single-sample individuals by edge biomarkers. , 2015, Journal of molecular cell biology.

[30]  Juan Liu,et al.  Pattern fusion analysis by adaptive alignment of multiple heterogeneous omics data , 2017, Bioinform..

[31]  K. Aihara,et al.  Personalized characterization of diseases using sample-specific networks , 2016, bioRxiv.

[32]  Chris H. Q. Ding,et al.  Cluster Structure of K-means Clustering via Principal Component Analysis , 2004, PAKDD.

[33]  Yun Feng,et al.  MicroRNA‑449a is a potential predictor of colitis‑associated colorectal cancer progression. , 2018, Oncology reports.

[34]  Rona S. Gertner,et al.  Single cell RNA Seq reveals dynamic paracrine control of cellular variation , 2014, Nature.

[35]  Kazuyuki Aihara,et al.  Detecting early-warning signals for sudden deterioration of complex diseases by dynamical network biomarkers , 2012, Scientific Reports.

[36]  Xinxia Peng,et al.  Computational identification of hepatitis C virus associated microRNA-mRNA regulatory modules in human livers , 2009, BMC Genomics.

[37]  Yuan Lin,et al.  SAFE-clustering: Single-cell Aggregated (From Ensemble) Clustering for Single-cell RNA-seq Data , 2017, bioRxiv.

[38]  Tu Bao Ho,et al.  Finding microRNA regulatory modules in human genome using rule induction , 2008, BMC Bioinformatics.

[39]  M. Cugmas,et al.  On comparing partitions , 2015 .

[40]  Xiuzhen Huang,et al.  SPARCoC: A New Framework for Molecular Pattern Discovery and Cancer Gene Identification , 2015, PloS one.

[41]  Kazuyuki Aihara,et al.  Hunt for the tipping point during endocrine resistance process in breast cancer by dynamic network biomarkers , 2018, Journal of molecular cell biology.

[42]  Mauricio Barahona,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.

[43]  M. Schaub,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.

[44]  Ruiqiang Li,et al.  Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells , 2013, Nature Structural &Molecular Biology.

[45]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[46]  Shawn M. Gillespie,et al.  Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma , 2014, Science.

[47]  A. Regev,et al.  Revealing the vectors of cellular identity with single-cell genomics , 2016, Nature Biotechnology.

[48]  Shao-Wu Zhang,et al.  A novel algorithm for finding optimal driver nodes to target control complex networks and its applications for drug targets identification , 2018, BMC Genomics.

[49]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[50]  C. Sander,et al.  Pattern discovery and cancer gene identification in integrated cancer genomic data , 2013, Proceedings of the National Academy of Sciences.

[51]  Carmen Esteban,et al.  Genomic characterization of liver metastases from colorectal cancer patients , 2016, Oncotarget.

[52]  Yoshiyuki Kuchino,et al.  Regulation of c-Myc through Phosphorylation at Ser-62 and Ser-71 by c-Jun N-Terminal Kinase* , 1999, The Journal of Biological Chemistry.

[53]  Xiangtian Yu,et al.  Big-data-based edge biomarkers: study on dynamical drug sensitivity and resistance in individuals , 2016, Briefings Bioinform..

[54]  Brian S. Roberts,et al.  The colorectal microRNAome. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Pierre Laurent-Puig,et al.  Mutations in the RAS‐MAPK, PI(3)K (phosphatidylinositol‐3‐OH kinase) signaling network correlate with poor survival in a population‐based series of colon cancers , 2008, International journal of cancer.

[56]  Joshua W. K. Ho,et al.  CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data , 2016, Genome Biology.

[57]  Sven Bergmann,et al.  A modular approach for integrative analysis of large-scale gene-expression and drug-response data , 2008, Nature Biotechnology.

[58]  Xiangtian Yu,et al.  Detecting Personalized Determinants During Drug Treatment from Omics Big Data. , 2018, Current pharmaceutical design.

[59]  Sarah A. Teichmann,et al.  Single-Cell RNA Sequencing Reveals T Helper Cells Synthesizing Steroids De Novo to Contribute to Immune Homeostasis , 2014, Cell reports.

[60]  Yun Feng,et al.  MicroRNA ‐ 449 a is a potential predictor of colitis ‐ associated colorectal cancer progression , 2022 .

[61]  Yong Wang,et al.  iPcc: a novel feature extraction method for accurate disease class discovery and prediction , 2013, Nucleic acids research.

[62]  Michael Krawczak,et al.  Genome-wide association study for colorectal cancer identifies risk polymorphisms in German familial cases and implicates MAPK signalling pathways in disease susceptibility. , 2010, Carcinogenesis.

[63]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[64]  Byoung-Tak Zhang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm045 Data and text mining Discovery of microRNA–mRNA modules via population-based probabilistic learning , 2007 .

[65]  Hans Clevers,et al.  Single-cell messenger RNA sequencing reveals rare intestinal cell types , 2015, Nature.

[66]  Lei Zhang,et al.  Discovering personalized driver mutation profiles of single samples in cancer by network control strategy , 2018, Bioinform..

[67]  I. Amit,et al.  Massively Parallel Single-Cell RNA-Seq for Marker-Free Decomposition of Tissues into Cell Types , 2014, Science.

[68]  Shi-Hua Zhang,et al.  Identifying multi-layer gene regulatory modules from multi-dimensional genomic data , 2012, Bioinform..

[69]  J. Schug,et al.  Single-Cell Transcriptomics of the Human Endocrine Pancreas , 2016, Diabetes.

[70]  Tao Zeng,et al.  Integrative Analysis of Omics Big Data. , 2018, Methods in molecular biology.

[71]  Ke Deng,et al.  High-dimensional genomic data bias correction and data integration using MANCIE , 2016, Nature Communications.