Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm

BackgroundGene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been available in literature, there still lack powerful ones that can take account of the complex data and detect the most informative signatures.MethodsIn this article, we present a new framework for identifying gene signatures using Pareto-optimal cluster size identification for RNA-seq data. We first performed pre-filtering steps and normalization, then utilized the empirical Bayes test in Limma package to identify the differentially expressed genes (DEGs). Next, we used a multi-objective optimization technique, “Multi-objective optimization for collecting cluster alternatives” (MOCCA in R package) on these DEGs to find Pareto-optimal cluster size, and then applied k-means clustering to the RNA-seq data based on the optimal cluster size. The best cluster was obtained through computing the average Spearman’s Correlation Score among all the genes in pair-wise manner belonging to the module. The best cluster is treated as the signature for the respective disease or cellular condition.ResultsWe applied our framework to a cervical cancer RNA-seq dataset, which included 253 squamous cell carcinoma (SCC) samples and 22 adenocarcinoma (ADENO) samples. We identified a total of 582 DEGs by Limma analysis of SCC versus ADENO samples. Among them, 260 are up-regulated genes and 322 are down-regulated genes. Using MOCCA, we obtained seven Pareto-optimal clusters. The best cluster has a total of 35 DEGs consisting of all-upregulated genes. For validation, we ran PAMR (prediction analysis for microarrays) classifier on the selected best cluster, and assessed the classification performance. Our evaluation, measured by sensitivity, specificity, precision, and accuracy, showed high confidence.ConclusionsOur framework identified a multi-objective based cluster that is treated as a signature that can classify the disease and control group of samples with higher classification performance (accuracy 0.935) for the corresponding disease. Our method is useful to find signature for any RNA-seq or microarray data.

[1]  Yan Guo,et al.  The International Conference on Intelligent Biology and Medicine (ICIBM) 2018: systems biology on diverse data types , 2018, BMC Systems Biology.

[2]  Ping Li,et al.  Identification of effective combinatorial markers for quality standardization of herbal medicines. , 2014, Journal of chromatography. A.

[3]  Keqin Li,et al.  Driver pattern identification over the gene co-expression of drug response in ovarian cancer by integrating high throughput genomics data , 2017, bioRxiv.

[4]  Ujjwal Maulik,et al.  Detecting TF-miRNA-gene network based modules for 5hmC and 5mC brain samples: a intra- and inter-species case-study between human and rhesus , 2018, BMC Genetics.

[5]  Zhongming Zhao,et al.  ConGEMs: Condensed Gene Co-Expression Module Discovery Through Rule-Based Clustering and Its Application to Carcinogenesis , 2017, Genes.

[6]  Sanghamitra Bandyopadhyay,et al.  Multi-Objective Optimization Approaches in Biological Learning System on Microarray Data , 2018, Multi-Objective Optimization.

[7]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics (e1071), TU Wien , 2014 .

[9]  Andrew J Vickers,et al.  Parametric versus non-parametric statistics in the analysis of randomized trials with non-normally distributed data , 2005, BMC medical research methodology.

[10]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  L. Hayduk,et al.  Structural equation model testing and the quality of natural killer cell activity measurements , 2005 .

[12]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[13]  Emad A. Rakha,et al.  Combinatorial biomarker expression in breast cancer , 2010, Breast Cancer Research and Treatment.

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[15]  João Pedro de Magalhães,et al.  Gene co-expression analysis for functional classification and gene–disease predictions , 2017, Briefings Bioinform..

[16]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[17]  Eric R. Ziegel,et al.  Engineering Statistics , 2004, Technometrics.

[18]  Ujjwal Maulik,et al.  IDPT: Insights into potential intrinsically disordered proteins through transcriptomic analysis of genes for prostate carcinoma epigenetic data. , 2016, Gene.

[19]  S. Bandyopadhyay,et al.  Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes , 2009, BMC Bioinformatics.

[20]  Sandhya Mehrotra,et al.  Combinatorial Control of Gene Expression , 2013, BioMed research international.

[21]  Zhongming Zhao,et al.  Towards integrated oncogenic marker recognition through mutual information-based statistically significant feature extraction: an association rule mining based study on cancer expression and methylation profiles , 2017, Quantitative Biology.

[22]  Matthias Wilmanns,et al.  Combinatorial control of gene expression , 2004, Nature Structural &Molecular Biology.

[23]  Ujjwal Maulik,et al.  Identifying Epigenetic Biomarkers using Maximal Relevance and Minimal Redundancy Based Feature Selection for Multi-Omics Data , 2017, IEEE Transactions on NanoBioscience.

[24]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[25]  Ujjwal Maulik,et al.  Multiobjective Genetic Clustering for Pixel Classification in Remote Sensing Imagery , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[26]  Jonathan M. Garibaldi,et al.  Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data , 2012, PloS one.

[27]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[28]  Anirban Mukhopadhyay,et al.  A Survey and Comparative Study of Statistical Tests for Identifying Differential Expression from Microarray Data , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Kenneth A. De Jong,et al.  Optimization of spectral signatures selection using multi-objective genetic algorithms , 2011, 2011 IEEE Congress of Evolutionary Computation (CEC).

[30]  Ujjwal Maulik,et al.  Module-Based Knowledge Discovery for Multiple-Cytosine-Variant Methylation Profile , 2018 .

[31]  Nurul Ainin Abdul Aziz,et al.  A 19-Gene expression signature as a predictor of survival in colorectal cancer , 2016, BMC Medical Genomics.

[32]  Jeremy J. W. Chen,et al.  A five-gene signature and clinical outcome in non-small-cell lung cancer. , 2007, The New England journal of medicine.

[33]  Zhongming Zhao,et al.  TrapRM: Transcriptomic and proteomic rule mining using weighted shortest distance based multiple minimum supports for multi-omics dataset , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[34]  Saurav Mallik,et al.  Integrating Multiple Data Sources for Combinatorial Marker Discovery: A Study in Tumorigenesis , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[36]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory, Second Edition , 2000, Statistics for Engineering and Information Science.

[37]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[38]  Zhongming Zhao,et al.  Discovering Disease-specific Biomarker Genes for Cancer Diagnosis and Prognosis , 2010, Technology in cancer research & treatment.

[39]  Ujjwal Maulik,et al.  Integrated analysis of gene expression and genome-wide DNA methylation for tumor prediction: An association rule mining-based approach , 2013, 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[40]  Zhongming Zhao,et al.  Integrative network analysis identifies key genes and pathways in the progression of hepatitis C virus induced hepatocellular carcinoma , 2011, BMC Medical Genomics.

[41]  Ujjwal Maulik,et al.  The HIV Nef protein modulates cellular and exosomal miRNA profiles in human monocytic cells , 2014, Journal of extracellular vesicles.

[42]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[43]  Ujjwal Maulik,et al.  Integrated Statistical and Rule-Mining Techniques for Dna Methylation and Gene Expression Data Analysis , 2013, J. Artif. Intell. Soft Comput. Res..

[44]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[45]  Herman Chernoff,et al.  Cluster Analysis for Applications (Michael R. Anderberg) , 1975 .

[46]  Jie Li,et al.  A new framework for identifying differentially expressed genes , 2007, Pattern Recognit..

[47]  Asit Kr Das,et al.  Strength pareto evolutionary algorithm based gene subset selection , 2017, 2017 International Conference on Big Data Analytics and Computational Intelligence (ICBDAC).

[48]  Ujjwal Maulik,et al.  Transcriptomic Analysis of mRNAs in Human Monocytic Cells Expressing the HIV-1 Nef Protein and Their Exosomes , 2015, BioMed research international.

[49]  Chuang Liu,et al.  A Gene Gravity Model for the Evolution of Cancer Genomes: A Study of 3,000 Cancer Genomes across 9 Cancer Types , 2015, PLoS Comput. Biol..

[50]  Anirban Mukhopadhyay,et al.  Identifying Non-Redundant Gene Markers from Microarray Data: A Multiobjective Variable Length PSO-Based Approach , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[51]  Thomas A. Runkler Pareto Optimality of Cluster Objective and Validity Functions , 2007, 2007 IEEE International Fuzzy Systems Conference.

[52]  Ying Xu,et al.  Editorial from the editor-in-chief , 2014, TCBB.

[53]  Ujjwal Maulik,et al.  Analyzing Large Gene Expression and Methylation Data Profiles Using StatBicRM: Statistical Biclustering-Based Rule Mining , 2015, PloS one.

[54]  Zhongming Zhao,et al.  Multi-Objective Optimized Fuzzy Clustering for Detecting Cell Clusters from Single-Cell Expression Profiles , 2019, Genes.

[55]  P. Sonneveld,et al.  A gene expression signature distinguishes innate response and resistance to proteasome inhibitors in multiple myeloma , 2017, Blood Cancer Journal.

[56]  Ujjwal Maulik,et al.  RANWAR: Rank-Based Weighted Association Rule Mining From Gene Expression and Methylation Data , 2015, IEEE Transactions on NanoBioscience.