TargetMine, an Integrated Data Warehouse for Candidate Gene Prioritisation and Target Discovery

Prioritising candidate genes for further experimental characterisation is a non-trivial challenge in drug discovery and biomedical research in general. An integrated approach that combines results from multiple data types is best suited for optimal target selection. We developed TargetMine, a data warehouse for efficient target prioritisation. TargetMine utilises the InterMine framework, with new data models such as protein-DNA interactions integrated in a novel way. It enables complicated searches that are difficult to perform with existing tools and it also offers integration of custom annotations and in-house experimental data. We proposed an objective protocol for target prioritisation using TargetMine and set up a benchmarking procedure to evaluate its performance. The results show that the protocol can identify known disease-associated genes with high precision and coverage. A demonstration version of TargetMine is available at http://targetmine.nibio.go.jp/.

[1]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[2]  V. McKusick Mendelian inheritance in man , 1971 .

[3]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[4]  J. Rashbass Online Mendelian Inheritance in Man. , 1995, Trends in genetics : TIG.

[5]  D. Latchman Transcription factors: an overview. , 1997, The international journal of biochemistry & cell biology.

[6]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[7]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[8]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[9]  T L Blundell,et al.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[10]  L. Wong,et al.  Technologies for Integrating Biological Data , 2002, Briefings Bioinform..

[11]  John P Helfrich Raw data to knowledge warehouse in proteomic-based drug discovery: a scientific data management issue. , 2002, BioTechniques.

[12]  M. Gerstein,et al.  Integrating Interactomes , 2002, Science.

[13]  D. Nebert Transcription factors and cancer: an overview. , 2002, Toxicology.

[14]  Andrew Hayes,et al.  GIMS: an integrated data storage and analysis environment for genomic and functional data , 2003, Yeast.

[15]  M. Vidal,et al.  Integrating 'omic' information: a bridge between genomics and systems biology. , 2003, Trends in genetics : TIG.

[16]  Carlos Alberto Heuser,et al.  Integrating Biological Databases , 2003, SBBD.

[17]  Shandar Ahmad,et al.  Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information , 2004, Bioinform..

[18]  Sameer Velankar,et al.  E-MSD: an integrated data resource for bioinformatics , 2004, Nucleic Acids Res..

[19]  Tao Xu,et al.  Atlas – a data warehouse for integrative bioinformatics , 2005, BMC Bioinformatics.

[20]  E. Birney,et al.  The International Protein Index: An integrated database for proteomics experiments , 2004, Proteomics.

[21]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[22]  J. Thornton,et al.  Predicting protein function from sequence and structural data. , 2005, Current opinion in structural biology.

[23]  Priyanka Gupta,et al.  BioWarehouse: a bioinformatics database warehouse toolkit , 2006, BMC Bioinformatics.

[24]  P. Bork,et al.  G2D: a tool for mining genes associated with disease , 2005, BMC Genetics.

[25]  Yi-Ping Phoebe Chen,et al.  Information Integration in Molecular Bioscience , 2005, Applied bioinformatics.

[26]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2005, Nucleic Acids Res..

[27]  Golan Yona,et al.  BIOZON: a system for unification, management and analysis of heterogeneous biological data , 2006, BMC Bioinformatics.

[28]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[29]  Julie M. Sullivan,et al.  FlyMine: an integrated database for Drosophila and Anopheles genomics , 2007, Genome Biology.

[30]  John P. Overington,et al.  How many drug targets are there? , 2006, Nature Reviews Drug Discovery.

[31]  David J. Porteous,et al.  SUSPECTS : enabling fast and effective prioritization of positional candidates , 2005 .

[32]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[33]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[34]  Kiyoko F. Aoki-Kinoshita,et al.  Gene annotation and pathway mapping in KEGG. , 2007, Methods in molecular biology.

[35]  Pall I. Olason,et al.  A human phenome-interactome network of protein complexes implicated in genetic disorders , 2007, Nature Biotechnology.

[36]  Jing Chen,et al.  Improved human disease candidate gene prioritization using mouse phenotype , 2007, BMC Bioinformatics.

[37]  John D. Storey,et al.  Harnessing naturally randomized transcription to infer regulatory relationships among genes , 2007, Genome Biology.

[38]  Obi L. Griffith,et al.  ORegAnno: an open-access community-driven resource for regulatory annotation , 2007, Nucleic Acids Res..

[39]  David S. Wishart,et al.  Nucleic Acids Research Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs and Metabolites , 2008 .

[40]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[41]  Jana Marie Schwarz,et al.  GeneDistiller—Distilling Candidate Genes from Linkage Intervals , 2008, PloS one.

[42]  Teruyoshi Hishiki,et al.  The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts , 2007, Nucleic Acids Res..

[43]  Howard L McLeod,et al.  CANDID: a flexible method for prioritizing candidate genes for complex human traits , 2008, Genetic epidemiology.

[44]  R. Shamir,et al.  Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. , 2008, Genome research.

[45]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[46]  A Burgun,et al.  Accessing and Integrating Data and Knowledge for Biomedical Research , 2008, Yearbook of Medical Informatics.

[47]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[48]  William Stafford Noble,et al.  How does multiple testing correction work? , 2009, Nature Biotechnology.

[49]  Yongliang Yang,et al.  Target discovery from data mining approaches. , 2009, Drug discovery today.

[50]  W. Kibbe,et al.  Annotating the human genome with Disease Ontology , 2009, BMC Genomics.

[51]  Xin Chen,et al.  New tools for functional genomic analysis. , 2009, Drug discovery today.

[52]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[53]  Andrzej Joachimiak,et al.  High-throughput crystallography for structural genomics. , 2009, Current opinion in structural biology.

[54]  María Martín,et al.  The Universal Protein Resource (UniProt) in 2010 , 2010 .

[55]  Kenji Mizuguchi,et al.  Network based analysis of hepatitis C virus core and NS4B protein interactions. , 2010, Molecular bioSystems.

[56]  Baris E. Suzek,et al.  The Universal Protein Resource (UniProt) in 2010 , 2009, Nucleic Acids Res..

[57]  Bart De Moor,et al.  Candidate gene prioritization by network analysis of differential expression using machine learning approaches , 2010, BMC Bioinformatics.

[58]  Kenji Mizuguchi,et al.  Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites , 2010, Bioinform..

[59]  Bart De Moor,et al.  A guide to web tools to prioritize candidate genes , 2011, Briefings Bioinform..