Dintor: functional annotation of genomic and proteomic data

BackgroundDuring the last decade, a great number of extremely valuable large-scale genomics and proteomics datasets have become available to the research community. In addition, dropping costs for conducting high-throughput sequencing experiments and the option to outsource them considerably contribute to an increasing number of researchers becoming active in this field. Even though various computational approaches have been developed to analyze these data, it is still a laborious task involving prudent integration of many heterogeneous and frequently updated data sources, creating a barrier for interested scientists to accomplish their own analysis.ResultsWe have implemented Dintor, a data integration framework that provides a set of over 30 tools to assist researchers in the exploration of genomics and proteomics datasets. Each of the tools solves a particular task and several tools can be combined into data processing pipelines. Dintor covers a wide range of frequently required functionalities, from gene identifier conversions and orthology mappings to functional annotation of proteins and genetic variants up to candidate gene prioritization and Gene Ontology-based gene set enrichment analysis. Since the tools operate on constantly changing datasets, we provide a mechanism to unambiguously link tools with different versions of archived datasets, which guarantees reproducible results for future tool invocations. We demonstrate a selection of Dintor’s capabilities by analyzing datasets from four representative publications. The open source software can be downloaded and installed on a local Unix machine. For reasons of data privacy it can be configured to retrieve local data only. In addition, the Dintor tools are available on our public Galaxy web service at http://dintor.eurac.edu.ConclusionsDintor is a computational annotation framework for the analysis of genomic and proteomic datasets, providing a rich set of tools that cover the most frequently encountered tasks. A major advantage is its capability to consistently handle multiple versions of tool-associated datasets, supporting the researcher in delivering reproducible results.

[1]  M. Rieder,et al.  Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations , 2011, Nature Genetics.

[2]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[3]  Sarwar Azam,et al.  An Integrated SNP Mining and Utilization (ISMU) Pipeline for Next Generation Sequencing Data , 2014, PloS one.

[4]  A. Gonzalez-Perez,et al.  Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. , 2011, American journal of human genetics.

[5]  V. Curcin,et al.  Scientific workflow systems - can one size fit all? , 2008, 2008 Cairo International Biomedical Engineering Conference.

[6]  R. Lewontin,et al.  On measures of gametic disequilibrium. , 1988, Genetics.

[7]  Manuel A. R. Ferreira,et al.  Meta‐analysis of heterogeneous data sources for genome‐scale identification of risk genes in complex phenotypes , 2011, Genetic epidemiology.

[8]  Ralf Herwig,et al.  The ConsensusPathDB interaction database: 2013 update , 2012, Nucleic Acids Res..

[9]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[10]  Albert J. Vilella,et al.  EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. , 2009, Genome research.

[11]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[12]  L. Stein,et al.  Annotating Cancer Variants and Anti-Cancer Therapeutics in Reactome , 2012, Cancers.

[13]  E. Birney,et al.  Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. , 2008, Genome research.

[14]  Ian M. Donaldson,et al.  iRefIndex: A consolidated protein interaction database with provenance , 2008, BMC Bioinformatics.

[15]  Johnny S. H. Kwan,et al.  A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases , 2012, Nucleic acids research.

[16]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[17]  Johann Gamper,et al.  Efficient haplotype block recognition of very long and dense genetic sequences , 2014, BMC Bioinformatics.

[18]  S. Batzoglou,et al.  Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[19]  Tricia Walker,et al.  Computer science , 1996, English for academic purposes series.

[20]  Jonathan M. Mudge,et al.  The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. , 2009, Genome research.

[21]  Bart De Moor,et al.  An unbiased evaluation of gene prioritization tools , 2012, Bioinform..

[22]  E. Hollander,et al.  Antiepileptic Medications in Autism Spectrum Disorder: A Systematic Review and Meta-Analysis , 2013, Journal of Autism and Developmental Disorders.

[23]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[24]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[25]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[26]  B. Dickson,et al.  A genome-wide transgenic RNAi library for conditional gene inactivation in Drosophila , 2007, Nature.

[27]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[28]  P. Stenson,et al.  Human Gene Mutation Database (HGMD®): 2003 update , 2003, Human mutation.

[29]  Arcadi Navarro,et al.  Genome-wide association studies pipeline (GWASpi): a desktop application for genome-wide SNP analysis and management , 2011, Bioinform..

[30]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[31]  Y. Moreau,et al.  Computational tools for prioritizing candidate genes: boosting disease gene discovery , 2012, Nature Reviews Genetics.

[32]  Jim Thurmond,et al.  FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations , 2014, Nucleic Acids Res..

[33]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[34]  Jing Chen,et al.  Improved human disease candidate gene prioritization using mouse phenotype , 2007, BMC Bioinformatics.

[35]  Jana Marie Schwarz,et al.  GeneDistiller—Distilling Candidate Genes from Linkage Intervals , 2008, PloS one.

[36]  Robert Gentleman,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[37]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[38]  Damian Smedley,et al.  BioMart – biological queries made easy , 2009, BMC Genomics.

[39]  Cynthia L. Smith,et al.  The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information , 2004, Genome Biology.

[40]  Léon Personnaz,et al.  Enrichment or depletion of a GO category within a class of genes: which test? , 2007, Bioinform..

[41]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[42]  Volker Brendel,et al.  The BioExtract Server: a web-based bioinformatic workflow platform , 2011, Nucleic Acids Res..

[43]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[44]  Chuong B. Do,et al.  Large-scale meta-analysis of genome-wide association data identifies six new risk loci for Parkinson’s disease , 2014, Nature Genetics.

[45]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Jill P Mesirov,et al.  Accessible Reproducible Research , 2010, Science.

[47]  Masao Nagasaki,et al.  XiP: a computational environment to create, extend and share workflows , 2013, Bioinform..

[48]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[49]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[50]  Murat Sincan,et al.  VAR‐MD: A tool to analyze whole exome–genome variants in small human pedigrees with mendelian inheritance , 2012, Human mutation.

[51]  Morgan C. Giddings,et al.  Defining functional DNA elements in the human genome , 2014, Proceedings of the National Academy of Sciences.

[52]  Alessandro Vullo,et al.  Ensembl 2015 , 2014, Nucleic Acids Res..

[53]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[54]  Gregory Butler,et al.  A review of genomic data warehousing systems , 2014, Briefings Bioinform..

[55]  Bradley P. Coe,et al.  Multiplex Targeted Sequencing Identifies Recurrently Mutated Genes in Autism Spectrum Disorders , 2012, Science.

[56]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[57]  R. Kofler,et al.  Research resource: transcriptional response to glucocorticoids in childhood acute lymphoblastic leukemia. , 2012, Molecular endocrinology.

[58]  Joel Dudley,et al.  Bioinformatics software for biologists in the genomics era , 2007, Bioinform..

[59]  Mario Cannataro,et al.  Semantic similarity analysis of protein data: assessment with biological features and issues , 2012, Briefings Bioinform..

[60]  Bart De Moor,et al.  Candidate gene prioritization by network analysis of differential expression using machine learning approaches , 2010, BMC Bioinformatics.

[61]  Elspeth A. Bruford,et al.  Genenames.org: the HGNC resources in 2015 , 2014, Nucleic Acids Res..

[62]  Henning Hermjakob,et al.  The Reactome pathway Knowledgebase , 2015, Nucleic acids research.

[63]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[64]  François Schiettecatte,et al.  OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders , 2014, Nucleic Acids Res..

[65]  Howard L McLeod,et al.  CANDID: a flexible method for prioritizing candidate genes for complex human traits , 2008, Genetic epidemiology.