Genomic and Proteomic Databases: Foundations, Current Status and Future Applications

In this paper we have provided an extensive survey of the databases and other resources related to the current research in bioinformatics and the issues that confront the database researcher in helping the biologists. Initially we give an overview of the concepts and principles that are fundamental in understanding the basis of the data that has been captured in these databases. We briefly trace the evolution of biological advances and point out the importance of capturing data about genes, the fundamental building blocks that encode the characteristics of life and proteins that are the essential ingredients for sustaining life. The study of genes and proteins is becoming extremely important and is being known as genomics and proteomics, respectively. Whereas there are numerous databases related to various subfields of biology, we have maintained a focus on genomic and proteomic databases which are the crucial stepping stones for other fields and are expected to play an important role in the future applications of biology and medicine. A detailed listing of these databases with information about their sizes, formats and current status is presented. Related databases like molecular pathways and interconnection network databases are mentioned, but their full coverage would be beyond the scope of a single paper. We comment on the peculiar nature of the data in biology that presents special problems in organizing and accessing these databases. We also discuss the capabilities needed for database development and information management in the bioinformatics arena with particular attention to ontology development. Two research case studies based on our own research are summarized dealing with the development of a new genome database called Mitomap and the creation of a framework for discovery of relationships among genes from the biomedical literature. The paper concludes with an overview of the applications that will be driven from these databases in medicine and healthcare. A glossary of important terms is provided at the end of the paper.

[1]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[2]  Michael Y. Galperin,et al.  Sequence — Evolution — Function , 2003, Springer US.

[3]  L. Hardy,et al.  The multiple orthogonal tools approach to define molecular causation in the validation of druggable targets. , 2004, Drug discovery today.

[4]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[5]  M. Ko,et al.  Embryogenomics: developmental biology meets genomics. , 2001, Trends in biotechnology.

[6]  N. Anderson,et al.  Proteome and proteomics: New technologies, new concepts, and new words , 1998, Electrophoresis.

[7]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[8]  Shamkant B. Navathe,et al.  Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[9]  C. Sander,et al.  A database of protein structure families with common folding motifs , 1992, Protein science : a publication of the Protein Society.

[10]  F B ROGERS,et al.  Medical Subject Headings , 1948, Nature.

[11]  Shamkant B. Navathe,et al.  Text analysis of MEDLINE for discovering functional relationships among genes: evaluation of keyword extraction weighting schemes , 2006, Int. J. Data Min. Bioinform..

[12]  Ramez Elmasri,et al.  Fundamentals of Database Systems, 5th Edition , 2006 .

[13]  Peer Kröger,et al.  A Molecular Biology Database Digest , 2000 .

[14]  Shamkant B. Navathe,et al.  MITOMAP: a human mitochondrial genome database--1998 update , 1998, Nucleic Acids Res..

[15]  Marc De Braekeleer Genetics and public health in the 21st century : using genetic information to improve health and prevent disease , 2001 .

[16]  John Quackenbush,et al.  TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets , 2003, Bioinform..

[17]  Peer Bork,et al.  SMART 5: domains in the context of genomes and networks , 2005, Nucleic Acids Res..

[18]  Shamkant B. Navathe,et al.  Investigation into biomedical literature classification using support vector machines , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[19]  A. Chakravarti Single nucleotide polymorphisms: . . .to a future of genetic medicine , 2001, Nature.

[20]  C. McCarty,et al.  Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. , 2005, Personalized medicine.

[21]  C. Branden,et al.  Introduction to protein structure , 1991 .

[22]  D W Nebert,et al.  Pharmacogenomics: out of the lab and into the community. , 2001, Trends in Biotechnology.

[23]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[24]  L. Kaminsky,et al.  Human P450 metabolism of warfarin. , 1997, Pharmacology & therapeutics.

[25]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..

[26]  Peter D. Karp,et al.  EcoCyc: a comprehensive database resource for Escherichia coli , 2004, Nucleic Acids Res..

[27]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[28]  Michael Gruenberger,et al.  Pathbase: a database of mutant mouse pathology , 2004, Nucleic Acids Res..

[29]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[30]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[31]  D. Stephan,et al.  Integrating microarrays into disease-gene identification strategies , 2003, Expert review of molecular diagnostics.

[32]  Peter D. Karp,et al.  The EcoCyc Database , 2002, Nucleic Acids Res..

[33]  Shamkant B. Navathe,et al.  MITOMAP: a human mitochondrial genome database—2004 update , 2004, Nucleic Acids Res..

[34]  A. Hopkins,et al.  The druggable genome , 2002, Nature Reviews Drug Discovery.

[35]  J. Nielsen,et al.  Uncovering transcriptional regulation of metabolism by using metabolic network topology. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[37]  Madeline A. Crosby,et al.  FlyBase: genomes by the dozen , 2006, Nucleic Acids Res..

[38]  Kari Stefansson,et al.  deCODE genetics, Inc. , 2003, Pharmacogenomics.

[39]  Subbarao Kambhampati,et al.  Integration of biological sources: current systems and challenges ahead , 2004, SGMD.

[40]  Anu Pinnamaneni,et al.  Database Mining in the Human Genome Initiative , 2000 .

[41]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[42]  A. Chakravarti Population genetics—making sense out of sequence , 1999, Nature Genetics.

[43]  A. R. Srinivasan,et al.  The nucleic acid database. A comprehensive relational database of three-dimensional structures of nucleic acids. , 1992, Biophysical journal.

[44]  Muin J Khoury Commentary: epidemiology and the continuum from genetic research to genetic testing. , 2002, American journal of epidemiology.

[45]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[46]  Barbara A. Eckman,et al.  Graph data management for molecular and cell biology , 2006, IBM J. Res. Dev..

[47]  Amund Tveit,et al.  Discovering biological motifs with genetic programming , 2005, GECCO '05.

[48]  Shamkant B. Navathe,et al.  Vertical partitioning algorithms for database design , 1984, TODS.

[49]  Maria Jesus Martin,et al.  High-quality Protein Knowledge Resource: SWISS-PROT and TrEMBL , 2002, Briefings Bioinform..

[50]  Robert K. Naviaux,et al.  The Spectrum of Mitochondrial Disease , 1998 .

[51]  P. Meltzer Spotting the target: microarrays for disease gene discovery. , 2001, Current opinion in genetics & development.

[52]  Dhavendra Kumar,et al.  Genomic medicine: a new frontier of medicine in the twenty first century , 2007, Genomic Medicine.

[53]  Mark A. Williams,et al.  The Bioinformatics Template Library—generic Components for Biocomputing , 2001 .

[54]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[55]  D Söll,et al.  The Human Genome Project: a paradigm for information management in the life sciences , 1991, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[56]  John B. Anderson,et al.  CDD: a Conserved Domain Database for protein classification , 2004, Nucleic Acids Res..

[57]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[58]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[59]  L. Cardon,et al.  Association study designs for complex diseases , 2001, Nature Reviews Genetics.

[60]  Dhavendra Kumar Genome Mirror-2006 , 2007, Genomic Medicine.

[61]  Richard J. Roberts,et al.  REBASE—enzymes and genes for DNA restriction and modification , 2007, Nucleic Acids Res..

[62]  L. Snyder,et al.  Molecular genetics of bacteria , 1997 .

[63]  Ying Liu,et al.  Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[64]  Shamkant B. Navathe,et al.  Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering , 2004 .

[65]  M. Mckeown,et al.  blue cheese Mutations Define a Novel, Conserved Gene Involved in Progressive Neural Degeneration , 2003, The Journal of Neuroscience.

[66]  Francis S. Collins,et al.  Genomic medicine--a primer. , 2002, The New England journal of medicine.

[67]  R. Contreras,et al.  Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene , 1976, Nature.

[68]  Andrey N. Naumochkin,et al.  Transcription Regulatory Regions Database (TRRD): its status in 2002 , 2002, Nucleic Acids Res..

[69]  Andrew M. Tyrrell,et al.  The evolutionary computation approach to motif discovery in biological sequences , 2005, GECCO '05.

[70]  Douglass M. Turnbull,et al.  A roundabout route to gene therapy , 2002, Nature Genetics.

[71]  Dan Wu,et al.  EMBL Nucleotide Sequence Database in 2006 , 2006, Nucleic Acids Res..

[72]  James E. Bray,et al.  The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[73]  Shamkant B. Navathe,et al.  MITOMAP: a human mitochondrial genome database--1998 update , 1998, Nucleic Acids Res..

[74]  Terri K. Attwood,et al.  The PRINTS Database: A Resource for Identification of Protein Families , 2002, Briefings Bioinform..

[75]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): updates and enhancements , 2005, Nucleic Acids Res..

[76]  H. V. Jagadish,et al.  Database Management for Life Science Research: Summary Report of the Workshop on Data Management for Molecular and Cell Biology at the National Library of Medicine, Bethesda, Maryland, February 2-3, 2003 , 2003, OMICS.

[77]  Hiroaki Kitano,et al.  Foundations of systems biology , 2001 .

[78]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..