CamurWeb: a classification software and a large knowledge base for gene expression data of cancer

BackgroundThe high growth of Next Generation Sequencing data currently demands new knowledge extraction methods. In particular, the RNA sequencing gene expression experimental technique stands out for case-control studies on cancer, which can be addressed with supervised machine learning techniques able to extract human interpretable models composed of genes, and their relation to the investigated disease. State of the art rule-based classifiers are designed to extract a single classification model, possibly composed of few relevant genes. Conversely, we aim to create a large knowledge base composed of many rule-based models, and thus determine which genes could be potentially involved in the analyzed tumor. This comprehensive and open access knowledge base is required to disseminate novel insights about cancer.ResultsWe propose CamurWeb, a new method and web-based software that is able to extract multiple and equivalent classification models in form of logic formulas (“if then” rules) and to create a knowledge base of these rules that can be queried and analyzed. The method is based on an iterative classification procedure and an adaptive feature elimination technique that enables the computation of many rule-based models related to the cancer under study. Additionally, CamurWeb includes a user friendly interface for running the software, querying the results, and managing the performed experiments. The user can create her profile, upload her gene expression data, run the classification analyses, and interpret the results with predefined queries. In order to validate the software we apply it to all public available RNA sequencing datasets from The Cancer Genome Atlas database obtaining a large open access knowledge base about cancer. CamurWeb is available at http://bioinformatics.iasi.cnr.it/camurweb.ConclusionsThe experiments prove the validity of CamurWeb, obtaining many classification models and thus several genes that are associated to 21 different cancer types. Finally, the comprehensive knowledge base about cancer and the software tool are released online; interested researchers have free access to them for further studies and to design biological experiments in cancer research.

[1]  Magnus K. Magnusson,et al.  Expression and Functional Role of Sprouty-2 in Breast Morphogenesis , 2013, PloS one.

[2]  L. Staudt,et al.  The NCI Genomic Data Commons as an engine for precision medicine. , 2017, Blood.

[3]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[4]  R. Wilson,et al.  The Next-Generation Sequencing Revolution and Its Impact on Genomics , 2013, Cell.

[5]  Marco Masseroli,et al.  TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas , 2016, BMC Bioinformatics.

[6]  Giovanni Felici,et al.  Clustering and Classification Techniques for Gene Expression Profile Pattern Analysis , 2015 .

[7]  Giovanni Felici,et al.  CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules , 2015, Bioinform..

[8]  Ichiro Mori,et al.  Prognostic significance of vascular endothelial growth factor D in breast carcinoma with long-term follow-up. , 2003, Clinical cancer research : an official journal of the American Association for Cancer Research.

[9]  Robert P. Friedland,et al.  Humans Have Antibodies against a Plant Virus: Evidence from Tobacco Mosaic Virus , 2013, PloS one.

[10]  Fatih Ozsolak,et al.  RNA sequencing: advances, challenges and opportunities , 2011, Nature Reviews Genetics.

[11]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[12]  Ron Edgar,et al.  Gene Expression Omnibus ( GEO ) : Microarray data storage , submission , retrieval , and analysis , 2008 .

[13]  Trey Ideker,et al.  Analysis of Matched Tumor and Normal Profiles Reveals Common Transcriptional and Epigenetic Signals Shared across Cancer Types , 2015, PloS one.

[14]  Christian Darabos,et al.  The multiscale backbone of the human phenotype network based on biological pathways , 2014, BioData Mining.

[15]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[16]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[17]  John David N. Dionisio,et al.  The JavaScript Programming Language , 2009 .

[18]  Simon Stobart,et al.  The MySQL Database Management System , 2002 .

[19]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[20]  W. Liang,et al.  9) TM4 Microarray Software Suite , 2006 .

[21]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[22]  K. Seuwen,et al.  Adhesion GPCR Function in Pulmonary Development and Disease. , 2016, Handbook of experimental pharmacology.

[23]  Daniele Santoni,et al.  Next generation sequencing reads comparison with an alignment-free distance , 2014, BMC Research Notes.

[24]  Giovanni Felici,et al.  Supervised DNA Barcodes species classification: analysis, comparisons and results , 2014, BioData Mining.

[25]  C. Sheridan Illumina claims $1,000 genome win , 2014, Nature Biotechnology.

[26]  Arianna Di Napoli,et al.  Circulating MMP11 and specific antibody immune response in breast and prostate cancer patients , 2014, Journal of Translational Medicine.

[27]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[28]  Giovanni Felici,et al.  Integer programming models for feature selection: New extensions and a randomized solution algorithm , 2016, Eur. J. Oper. Res..

[29]  G. von Heijne,et al.  Tissue-based map of the human proteome , 2015, Science.

[30]  Richard W Tothill,et al.  Next-generation sequencing for cancer diagnostics: a practical perspective. , 2011, The Clinical biochemist. Reviews.

[31]  Wde Client-Server Architecture , 2008, Encyclopedia of Multimedia.

[32]  Juli D. Klemm,et al.  A Comprehensive Infrastructure for Big Data in Cancer Research: Accelerating Cancer Research and Precision Medicine , 2017, Front. Cell Dev. Biol..

[33]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[34]  Erika Check Hayden,et al.  Technology: The $1,000 genome , 2014, Nature.

[35]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[36]  Gautier Koscielny,et al.  Ensembl 2012 , 2011, Nucleic Acids Res..

[37]  Giovanni Felici,et al.  MALA: A Microarray Clustering and Classification Software , 2012, 2012 23rd International Workshop on Database and Expert Systems Applications.

[38]  Giovanni Felici,et al.  Clinical Data Mining: Problems, Pitfalls and Solutions , 2013, 2013 24th International Workshop on Database and Expert Systems Applications.