IRIS-TCGA: An Information Retrieval and Integration System for Genomic Data of Cancer

Data integration is one of the most challenging research topic in many knowledge domains, and biology is surely one of them. However theory and state of the art methods make this task complex for most of the small research centers. Fortunately, several organizations are focusing on collecting heterogeneous data making an easier task to design analysis tools and test biological and medical hypothesis on integrated data. One of the most evident case of such efforts is The Cancer Genome Atlas (TCGA), a data base that contains a large variety of information related to different types of cancer. This data base offers a great opportunity to those interested in performing analysis of integrated data; however, its exploitation is not so easy since non trivial efforts are required to extract and combine data before it could be analyzed in an integrated perspective. In this paper we present IRIS-TCGA, an online web service developed to perform multiple queries for data integration on TCGA. Differently from other tools that have been proposed to interact with TCGA, IRIS-TCGA allows a direct access to the data and enables to extract detailed combinations of subsets of the repository, according to filters and high-order queries. The structure of the system is simple, as it is built on two main operators, union and intersection, that are then used to construct queries of higher complexity. The first version of the system supports the extraction and integration of gene expression (RNA-sequencing, microarrays), DNA-methylation, and DNA-sequencing (mutations) data from experiments on tissues of patients, together with their related meta data, in a gene oriented organization. The extracted data matrices are particularly suited for data mining applications (e.g., classification). Finally, we show two application examples, where IRIS-TCGA is used for integrating genomic data from RNA-sequencing and DNA-methylation experiments, and where state-of-the-art bioinformatics analysis tools are applied to the integrated data in order to extract new knowledge from them. IRIS-TCGA is freely available at http://bioinf.iasi.cnr.it/iristcga/.

[1]  Yann Joly,et al.  Data Sharing in the Post-Genomic World: The Experience of the International Cancer Genome Consortium (ICGC) Data Access Compliance Office (DACO) , 2012, PLoS Comput. Biol..

[2]  Giovanni Felici,et al.  Clinical Data Mining: Problems, Pitfalls and Solutions , 2013, 2013 24th International Workshop on Database and Expert Systems Applications.

[3]  Christian Darabos,et al.  The multiscale backbone of the human phenotype network based on biological pathways , 2014, BioData Mining.

[4]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[5]  Anna Tramontano,et al.  FIDEA: a server for the functional interpretation of differential expression analysis , 2013, Nucleic Acids Res..

[6]  Gianluca Bontempi,et al.  TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data , 2015, Nucleic acids research.

[7]  Allen D. Delaney,et al.  Conserved Role of Intragenic DNA Methylation in Regulating Alternative Promoters , 2010, Nature.

[8]  K. Ovaska,et al.  Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme , 2010, Genome Medicine.

[9]  Chunling Zhang,et al.  Correlation between DNA methylation and gene expression in the brains of patients with bipolar disorder and schizophrenia , 2014, Bipolar disorders.

[10]  B. Cullen,et al.  Sequence requirements for micro RNA processing and function in human cells. , 2003, RNA.

[11]  Giovanni Felici,et al.  Supervised DNA Barcodes species classification: analysis, comparisons and results , 2014, BioData Mining.

[12]  George A. F. Seber,et al.  Linear regression analysis , 1977 .

[13]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[14]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[15]  Peng Qiu,et al.  TCGA-Assembler: open-source software for retrieving and processing TCGA data , 2014, Nature Methods.

[16]  Giovanni Felici,et al.  CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules , 2015, Bioinform..

[17]  A. Bird CpG-rich islands and the function of DNA methylation , 1986, Nature.

[18]  A. Bird DNA methylation patterns and epigenetic memory. , 2002, Genes & development.

[19]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[20]  K. Gunderson,et al.  High density DNA methylation array with single CpG site resolution. , 2011, Genomics.

[21]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[22]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[23]  Erika Check Hayden,et al.  Technology: The $1,000 genome , 2014, Nature.

[24]  Giovanni Felici,et al.  MALA: A Microarray Clustering and Classification Software , 2012, 2012 23rd International Workshop on Database and Expert Systems Applications.

[25]  Giovanni Felici,et al.  Genomic Data Integration: A Case Study on Next Generation Sequencing of Cancer , 2016, 2016 27th International Workshop on Database and Expert Systems Applications (DEXA).

[26]  Francine E. Garrett-Bakelman,et al.  Base-Pair Resolution DNA Methylation Sequencing Reveals Profoundly Divergent Epigenetic Landscapes in Acute Myeloid Leukemia , 2012, PLoS genetics.

[27]  Daniele Santoni,et al.  Next generation sequencing reads comparison with an alignment-free distance , 2014, BMC Research Notes.

[28]  Benjamin E. Gross,et al.  The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. , 2012, Cancer discovery.

[29]  Alfonso Valencia,et al.  Epigenomic analysis detects widespread gene-body DNA hypomethylation in chronic lymphocytic leukemia , 2012, Nature Genetics.

[30]  David Gomez-Cabrero,et al.  Data integration in the era of omics: current and future challenges , 2014, BMC Systems Biology.

[31]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[32]  Shengwu Xiong,et al.  InDel marker detection by integration of multiple softwares using machine learning techniques , 2016, BMC Bioinformatics.

[33]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[34]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[35]  Joachim L. Schultze,et al.  Web-TCGA: an online platform for integrated analysis of molecular cancer data sets , 2016, BMC Bioinformatics.

[36]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[37]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .