CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules

Abstract Motivation: Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case–control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorithms compute a single classification model that contains few features (genes). On the contrary, our goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the genes related to the predicted class. Results: We propose CAMUR, a new method that extracts multiple and equivalent classification models. CAMUR iteratively computes a rule-based classification model, calculates the power set of the genes present in the rules, iteratively eliminates those combinations from the data set, and performs again the classification procedure until a stopping criterion is verified. CAMUR includes an ad-hoc knowledge repository (database) and a querying tool. We analyze three different types of RNA-seq data sets (Breast, Head and Neck, and Stomach Cancer) from The Cancer Genome Atlas (TCGA) and we validate CAMUR and its models also on non-TCGA data. Our experimental results show the efficacy of CAMUR: we obtain several reliable equivalent classification models, from which the most frequent genes, their relationships, and the relation with a particular cancer are deduced. Availability and implementation: dmb.iasi.cnr.it/camur.php Contact: emanuel@iasi.cnr.it Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Weimin Xiao,et al.  Evolving accurate and compact classification rules with gene expression programming , 2003, IEEE Trans. Evol. Comput..

[2]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[3]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[4]  Toshihide Ibaraki,et al.  Logical Analysis of Data , 2005 .

[5]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[6]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[7]  Kaisa Miettinen,et al.  Nonlinear multiobjective optimization , 1998, International series in operations research and management science.

[8]  Sanghyun Park,et al.  Integrative Gene Network Construction to Analyze Cancer Recurrence Using Semi-Supervised Learning , 2014, PloS one.

[9]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[10]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[11]  Daniel Q. Naiman,et al.  Simple decision rules for classifying human cancers from gene expression profiles , 2005, Bioinform..

[12]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[13]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[14]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[15]  Klaus Truemper,et al.  A MINSAT Approach for Learning in Logic Domains , 2002, INFORMS J. Comput..

[16]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[17]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[18]  S. Muthukrishnan,et al.  AGFS: Adaptive Genetic Fuzzy System for medical data classification , 2014, Appl. Soft Comput..

[19]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[20]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[21]  Eleanor Howe,et al.  RNA-Seq analysis in MeV , 2011, Bioinform..

[22]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[23]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[24]  K. Deb,et al.  Reliable classification of two-class cancer data using evolutionary algorithms. , 2003, Bio Systems.

[25]  G. von Heijne,et al.  Tissue-based map of the human proteome , 2015, Science.

[26]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[27]  Brian R. Gaines,et al.  Induction of ripple-down rules applied to modeling large databases , 1995, Journal of Intelligent Information Systems.

[28]  Martin A. Riedmiller,et al.  Advanced supervised learning in multi-layer perceptrons — From backpropagation to adaptive learning algorithms , 1994 .

[29]  W. Ramakrishna,et al.  Machine Learning Approaches Distinguish Multiple Stress Conditions using Stress-Responsive Genes and Identify Candidate Genes for Broad Resistance in Rice[C][W][OPEN] , 2013, Plant Physiology.

[30]  F. Nogueira,et al.  RNA Expression Profiles and Data Mining of Sugarcane Response to Low Temperature1 , 2003, Plant Physiology.

[31]  Daniel Q. Naiman,et al.  Classifying Gene Expression Profiles from Pairwise mRNA Comparisons , 2004, Statistical applications in genetics and molecular biology.

[32]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[33]  R. Tothill,et al.  Development and validation of a gene expression tumour classifier for cancer of unknown primary , 2015, Pathology.

[34]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[35]  Giovanni Felici,et al.  Supervised DNA Barcodes species classification: analysis, comparisons and results , 2014, BioData Mining.

[36]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[37]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[38]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[39]  Michael B. Miller Linear Regression Analysis , 2013 .

[40]  Arthur Liberzon,et al.  Using GenePattern for Gene Expression Analysis , 2008, Current protocols in bioinformatics.

[41]  Jan Komorowski,et al.  Learning Rule-based Models of Biological Process from Gene Expression Time Profiles Using Gene Ontology , 2003, Bioinform..

[42]  Giovanni Felici,et al.  GELA: A Software Tool for the Analysis of Gene Expression Data , 2015, 2015 26th International Workshop on Database and Expert Systems Applications (DEXA).

[43]  Richard A. Moore,et al.  Recurrent DGCR8, DROSHA, and SIX homeodomain mutations in favorable histology Wilms tumors. , 2015, Cancer cell.

[44]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[45]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[46]  Allen R. Tannenbaum,et al.  Recursive feature elimination for brain tumor classification using desorption electrospray ionization mass spectrometry imaging , 2012, 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[47]  Jack Y. Yang,et al.  A comparative study of different machine learning methods on microarray gene expression data , 2008, BMC Genomics.

[48]  Jing Yuan,et al.  Rule based classifier for the analysis of gene-gene and gene-environment interactions in genetic association studies , 2009, BioData Mining.

[49]  Anna Tramontano,et al.  FIDEA: a server for the functional interpretation of differential expression analysis , 2013, Nucleic Acids Res..