High-throughput translational medicine: challenges and solutions.

Recent technological advances in genomics now allow producing biological data at unprecedented tera- and petabyte scales. Yet, the extraction of useful knowledge from this voluminous data presents a significant challenge to a scientific community. Efficient mining of vast and complex data sets for the needs of biomedical research critically depends on seamless integration of clinical, genomic, and experimental information with prior knowledge about genotype-phenotype relationships accumulated in a plethora of publicly available databases. Furthermore, such experimental data should be accessible to a variety of algorithms and analytical pipelines that drive computational analysis and data mining. Translational projects require sophisticated approaches that coordinate and perform various analytical steps involved in the extraction of useful knowledge from accumulated clinical and experimental data in an orderly semiautomated manner. It presents a number of challenges such as (1) high-throughput data management involving data transfer, data storage, and access control; (2) scalable computational infrastructure; and (3) analysis of large-scale multidimensional data for the extraction of actionable knowledge.We present a scalable computational platform based on crosscutting requirements from multiple scientific groups for data integration, management, and analysis. The goal of this integrated platform is to address the challenges and to support the end-to-end analytical needs of various translational projects.

[1]  Ian T. Foster,et al.  Globus Online: Accelerating and Democratizing Science through Cloud-Based Services , 2011, IEEE Internet Computing.

[2]  Bart De Moor,et al.  An unbiased evaluation of gene prioritization tools , 2012, Bioinform..

[3]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[4]  Bart De Moor,et al.  Endeavour update: a web resource for gene prioritization in multiple species , 2008, Nucleic Acids Res..

[5]  Aleksandra M Walczak,et al.  Information transmission in genetic regulatory networks: a review , 2011, Journal of physics. Condensed matter : an Institute of Physics journal.

[6]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[7]  Oliver Eulenstein,et al.  Bioinformatics Research and Applications , 2008 .

[8]  Sandeep Sahu,et al.  OncDRS: An integrative clinical and genomic data platform for enabling translational research and precision medicine , 2015, Applied & translational genomics.

[9]  Kenneth Wysocki,et al.  Diseasome , 2011, Annual Review of Nursing Research.

[10]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[11]  Tin Wee Tan,et al.  Towards big data science in the decade ahead from ten years of InCoB and the 1st ISCB-Asia Joint Conference , 2011, BMC Bioinformatics.

[12]  Y. Moreau,et al.  Computational tools for prioritizing candidate genes: boosting disease gene discovery , 2012, Nature Reviews Genetics.

[13]  David E. Smith,et al.  A Flexible, Open, Decentralized System for Digital Pathology Networks , 2012, HealthGrid.

[14]  Frances S. Turner,et al.  Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes , 2006, Nucleic acids research.

[15]  Ian T. Foster,et al.  A security architecture for computational grids , 1998, CCS '98.

[16]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[17]  Bairong Shen,et al.  Translational Biomedical Informatics in the Cloud: Present and Future , 2013, BioMed research international.

[18]  P. Robinson,et al.  The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. , 2008, American journal of human genetics.

[19]  Bing Zhang,et al.  WebGestalt: an integrated system for exploring gene sets in various biological contexts , 2005, Nucleic Acids Res..

[20]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[21]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[22]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[23]  Philip R. O. Payne,et al.  Translational informatics: enabling high-throughput research paradigms. , 2009, Physiological genomics.

[24]  T. Gilliam,et al.  Genetic-linkage mapping of complex hereditary disorders to a whole-genome molecular-interaction network. , 2008, Genome research.

[25]  Carolina Perez-Iratxeta,et al.  Linking genes to diseases: it's all in the data , 2009, Genome Medicine.

[26]  D. Kell,et al.  The Kyoto Encyclopedia of Genes and Genomes—KEGG , 2000, Yeast.

[27]  Yves Moreau,et al.  PINTA: a web server for network-based gene prioritization from expression data , 2011, Nucleic Acids Res..

[28]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[29]  P. Guest,et al.  Behavioral and molecular biomarkers in translational animal models for neuropsychiatric disorders. , 2011, International review of neurobiology.

[30]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[31]  E. R. Andersson,et al.  Genetic interaction between Lrp6 and Wnt5a during mouse development , 2009, Developmental dynamics : an official publication of the American Association of Anatomists.

[32]  A. Rzhetsky,et al.  Probing genetic overlap among complex human phenotypes , 2007, Proceedings of the National Academy of Sciences.

[33]  M. J. Harris,et al.  Mouse mutants with neural tube closure defects and their role in understanding human neural tube defects. , 2007, Birth defects research. Part A, Clinical and molecular teratology.

[34]  R Haux,et al.  Towards Clinical Bioinformatics: Advancing Genomic Medicine with Informatics Methods and Tools , 2004, Methods of Information in Medicine.

[35]  D. Valle,et al.  Online Mendelian Inheritance In Man (OMIM) , 2000, Human mutation.

[36]  Robert Stevens,et al.  Gene Ontology Consortium , 2014 .

[37]  A N Desai,et al.  Next‐generation sequencing: ready for the clinics? , 2012, Clinical genetics.

[38]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[39]  M. J. Harris,et al.  An update to the list of mouse mutants with neural tube closure defects and advances toward a complete genetic perspective of neural tube closure. , 2010, Birth defects research. Part A, Clinical and molecular teratology.

[40]  M. Oti,et al.  The modular nature of genetic diseases , 2006, Clinical genetics.

[41]  C. Sander,et al.  The HUPO PSI's Molecular Interaction format—a community standard for the representation of protein interaction data , 2004, Nature Biotechnology.

[42]  Steven C. Lawlor,et al.  MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data , 2003, Genome Biology.

[43]  G. Gibson Wellness and health omics linked to the environment: the WHOLE approach to personalized medicine. , 2014, Advances in experimental medicine and biology.

[44]  Damian Szklarczyk,et al.  The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored , 2010, Nucleic Acids Res..

[45]  R. Gräsbeck,et al.  Proteinuria in cubilin-deficient patients with selective vitamin B12 malabsorption , 2003, Pediatric Nephrology.

[46]  Bret Waters,et al.  Software as a service: A look at the customer benefits , 2005 .

[47]  D. Geschwind,et al.  Genetic advances in autism: heterogeneity and convergence on shared pathways. , 2009, Current opinion in genetics & development.

[48]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[49]  Yuanfang Guan,et al.  Tissue-Specific Functional Networks for Prioritizing Phenotype and Disease Genes , 2012, PLoS Comput. Biol..

[50]  Mark D. Robinson,et al.  FunSpec: a web-based cluster interpreter for yeast , 2002, BMC Bioinformatics.

[51]  Anton Nekrutenko,et al.  Using Galaxy to Perform Large‐Scale Interactive Data Analyses , 2007, Current protocols in bioinformatics.

[52]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[53]  Nicolas Delhomme,et al.  FACT – a framework for the functional interpretation of high-throughput experiments , 2005, BMC Bioinformatics.

[54]  Julio Saez-Rodriguez,et al.  Creating and analyzing pathway and protein interaction compendia for modelling signal transduction networks , 2012, BMC Systems Biology.

[55]  Elske Ammenwerth,et al.  Towards clinical bioinformatics: Advancing genomic medicine with informatics methods and tools - Findings from the IMIA Yearbook of Medical Informatics 2004 , 2004 .

[56]  Christina Backes,et al.  GeneTrail—advanced gene set enrichment analysis , 2007, Nucleic Acids Res..

[57]  Léon Personnaz,et al.  Enrichment or depletion of a GO category within a class of genes: which test? , 2007, Bioinform..

[58]  James G. R. Gilbert,et al.  The vertebrate genome annotation (Vega) database , 2004, Nucleic Acids Res..

[59]  Y. Pawitan,et al.  Human genetics and genomics a decade after the release of the draft sequence of the human genome , 2011, Human Genomics.

[60]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[61]  E. Pennisi Genomics. ENCODE project writes eulogy for junk DNA. , 2012, Science.

[62]  Andrea Califano,et al.  Reverse‐engineering human regulatory networks , 2012, Wiley interdisciplinary reviews. Systems biology and medicine.

[63]  Thomas C. Wiegers,et al.  Text Mining Effectively Scores and Ranks the Literature for Improving Chemical-Gene-Disease Curation at the Comparative Toxicogenomics Database , 2013, PloS one.

[64]  D. Mu,et al.  Roles of planar cell polarity pathways in the development of neutral tube defects , 2011, Journal of Biomedical Science.

[65]  Ian T. Foster,et al.  Lynx web services for annotations and systems analysis of multi-gene disorders , 2014, Nucleic Acids Res..

[66]  B. Maher ENCODE: The human encyclopaedia , 2012, Nature.

[67]  Inna Dubchak,et al.  VISTA Region Viewer (RViewer) - a computational system for prioritizing genomic intervals for biomedical studies , 2011, Bioinform..

[68]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[69]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[70]  Kenneth H. Buetow,et al.  PID: the Pathway Interaction Database , 2008, Nucleic Acids Res..

[71]  W. J. Kent,et al.  The UCSC Genome Browser , 2003, Current protocols in bioinformatics.

[72]  A. Chapelle,et al.  The intrinsic factor–vitamin B12 receptor, cubilin, is a high-affinity apolipoprotein A-I receptor facilitating endocytosis of high-density lipoprotein , 1999, Nature Medicine.

[73]  R. Padmanabhan Etiology, pathogenesis and prevention of neural tube defects , 2006, Congenital anomalies.

[74]  Ivan Molineris,et al.  An atlas of tissue-specific conserved coexpression for functional annotation and disease gene prediction , 2011, European Journal of Human Genetics.

[75]  A. Chapelle,et al.  Mutations in CUBN, encoding the intrinsic factor-vitamin B 12 receptor, cubilin, cause hereditary megaloblastic anaemia 1 , 1999, Nature Genetics.

[76]  B. Franke,et al.  An association study of 45 folate-related genes in spina bifida: Involvement of cubilin (CUBN) and tRNA aspartic acid methyltransferase 1 (TRDMT1). , 2009, Birth defects research. Part A, Clinical and molecular teratology.

[77]  L. Pennacchio,et al.  Comparative genomics: a tool to functionally annotate human DNA. , 2007, Methods in molecular biology.

[78]  M. E. Ross Gene–environment interactions, folate metabolism and the embryonic nervous system , 2010, Wiley interdisciplinary reviews. Systems biology and medicine.

[79]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[80]  Lior Pachter,et al.  VISTA: computational tools for comparative genomics , 2004, Nucleic Acids Res..

[81]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[82]  Thorsten Schmidt,et al.  ProfCom: a web tool for profiling the complex functionality of gene groups identified from high-throughput data , 2008, Nucleic Acids Res..

[83]  Dan Wu,et al.  EMBL Nucleotide Sequence Database in 2006 , 2006, Nucleic Acids Res..

[84]  Mario Albrecht,et al.  Recent approaches to the prioritization of candidate disease genes , 2012, Wiley interdisciplinary reviews. Systems biology and medicine.

[85]  A. Barabasi,et al.  Interactome Networks and Human Disease , 2011, Cell.

[86]  Juli D. Klemm,et al.  The caBIG® Life Science Business Architecture Model , 2011, Bioinform..

[87]  Martin Vingron,et al.  Ontologizer 2.0 - a multifunctional tool for GO term enrichment analysis and data exploration , 2008, Bioinform..

[88]  May D. Wang,et al.  GoMiner: a resource for biological interpretation of genomic and proteomic data , 2003, Genome Biology.

[89]  Søren Brunak,et al.  MetaRanker 2.0: a web server for prioritization of genetic variation data , 2013, Nucleic Acids Res..

[90]  J. Gilbert,et al.  Neural Tube Defects and Folate Pathway Genes: Family-Based Association Tests of Gene–Gene and Gene–Environment Interactions , 2006, Environmental health perspectives.

[91]  Antoine M. van Oijen,et al.  Real-time single-molecule observation of rolling-circle DNA replication , 2009, Nucleic acids research.

[92]  J. Nadeau,et al.  Functional interactions between the LRP6 WNT co-receptor and folate supplementation. , 2010, Human molecular genetics.

[93]  Thomas Lengauer,et al.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure , 2006, Bioinform..

[94]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[95]  M. Kanehisa Linking databases and organisms: GenomeNet resources in Japan. , 1997, Trends in biochemical sciences.

[96]  David Haussler,et al.  ENCODE Data in the UCSC Genome Browser: year 5 update , 2012, Nucleic Acids Res..

[97]  Marylyn D. Ritchie,et al.  A comparison of cataloged variation between International HapMap Consortium and 1000 Genomes Project data , 2012, J. Am. Medical Informatics Assoc..

[98]  V. Whitehead Acquired and inherited disorders of cobalamin and folate in children. , 2006, British journal of haematology.

[99]  David S. Wishart,et al.  DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs , 2010, Nucleic Acids Res..

[100]  V. McKusick Mendelian inheritance in man , 1971 .

[101]  Patrick Ruch,et al.  Mapping proteins to disease terminologies: from UniProt to MeSH , 2008, BMC Bioinformatics.

[102]  Bart De Moor,et al.  A guide to web tools to prioritize candidate genes , 2011, Briefings Bioinform..

[103]  M. Kas,et al.  Cross-species behavioural genetics: A starting point for unravelling the neurobiology of human psychiatric disorders , 2011, Progress in Neuro-psychopharmacology and Biological Psychiatry.

[104]  Renata C. Geer,et al.  The NCBI BioSystems database , 2009, Nucleic Acids Res..

[105]  P. Khatri,et al.  Profiling gene expression using onto-express. , 2002, Genomics.

[106]  G. Shaw,et al.  Planar cell polarity pathway genes and risk for spina bifida , 2010, American journal of medical genetics. Part A.

[107]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[108]  A. Visel,et al.  Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. , 2010, Genome research.

[109]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[110]  Daniel L. Hartl,et al.  GeneMerge - Post-genomic Analysis, Data Mining, and Hypothesis Testing , 2003, Bioinform..

[111]  Inna Dubchak,et al.  VISTA Enhancer Browser—a database of tissue-specific human enhancers , 2006, Nucleic Acids Res..

[112]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[113]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[114]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[115]  Satoru Miyano,et al.  Statistical Absolute Evaluation of Gene Ontology Terms with Gene Expression Data , 2007, ISBRA.

[116]  M. Orešič,et al.  Pathways to the analysis of microarray data. , 2005, Trends in biotechnology.