Extending traditional query-based integration approaches for functional characterization of post-genomic data

MOTIVATION To identify and characterize regions of functional interest in genomic sequence requires full, flexible query access to an integrated, up-to-date view of all related information, irrespective of where it is stored (within an organization or across the Internet) and its format (traditional database, flat file, web site, results of runtime analysis). Wide-ranging multi-source queries often return unmanageably large result sets, requiring non-traditional approaches to exclude extraneous data. RESULTS Target Informatics Net (TINet) is a readily extensible data integration system developed at GlaxoSmith- Kline (GSK), based on the Object-Protocol Model (OPM) multidatabase middleware system of Gene Logic Inc. Data sources currently integrated include: the Mouse Genome Database (MGD) and Gene Expression Database (GXD), GenBank, SwissProt, PubMed, GeneCards, the results of runtime BLAST and PROSITE searches, and GSK proprietary relational databases. Special-purpose class methods used to filter and augment query results include regular expression pattern-matching over BLAST HSP alignments and retrieving partial sequences derived from primary structure annotations. All data sources and methods are accessible through an SQL-like query language or a GUI, so that when new investigations arise no additional programming beyond query specification is required. The power and flexibility of this approach are illustrated in such integrated queries as: (1) 'find homologs in genomic sequence to all novel genes cloned and reported in the scientific literature within the past three months that are linked to the MeSH term 'neoplasms"; (2) 'using a neuropeptide precursor query sequence, return only HSPs where the target genomic sequences conserve the G[KR][KR] motif at the appropriate points in the HSP alignment'; and (3) 'of the human genomic sequences annotated with exon boundaries in GenBank, return only those with valid putative donor/acceptor sites and start/stop codons'.

[1]  Limsoon Wong,et al.  BioKleisli: Integrating Biomedical Data and Analysis Packages , 2002 .

[2]  Victor Markowitz,et al.  OPM: Object-Protocol Model Data Management Tools ’97 , 2002 .

[3]  D. Valle,et al.  Online Mendelian Inheritance In Man (OMIM) , 2000, Human mutation.

[4]  Laura M. Haas,et al.  Towards heterogeneous multimedia information systems: the Garlic approach , 1995, Proceedings RIDE-DOM'95. Fifth International Workshop on Research Issues in Data Engineering-Distributed Object Management.

[5]  Peter M. D. Gray,et al.  A schema-based approach to building a bioinformatics database federation , 2000, Proceedings IEEE International Symposium on Bio-Informatics and Biomedical Engineering.

[6]  Anthony Kosky,et al.  Seamless Integration of Biological Applications within a Database Framework , 1999, ISMB.

[7]  Limsoon Wong The functional guts of the Kleisli query system , 2000, ICFP '00.

[8]  Carole A. Goble,et al.  Transparent access to multiple bioinformatics information sources , 2001, IBM Syst. J..

[9]  Dennis McLeod,et al.  A federated architecture for information management , 1985, TOIS.

[10]  Thomas J. Mowbray,et al.  The essential CORBA - systems integration using distributed objects , 1995 .

[11]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[12]  Thomas L. Madden,et al.  Protein sequence similarity searches using patterns as seeds. , 1998, Nucleic acids research.

[13]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[14]  Larry Wall,et al.  Programming Perl (2nd ed.) , 1996 .

[15]  Carole A. Goble,et al.  An ontology for bioinformatics applications , 1999, Bioinform..

[16]  Laura M. Haas,et al.  Integrating life sciences data-with a little Garlic , 2000, Proceedings IEEE International Symposium on Bio-Informatics and Biomedical Engineering.

[17]  Janan T. Eppig,et al.  GXD: a Gene Expression Database for the laboratory mouse: current status and recent enhancements , 2000, Nucleic Acids Res..

[18]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[19]  Eugene V. Koonin,et al.  A simple tool to search for sequence motifs that are conserved in BLAST outputs , 1994, Comput. Appl. Biosci..

[20]  William F. Clocksin,et al.  Programming in Prolog , 1981, Springer Berlin Heidelberg.

[21]  Barbara A. Eckman,et al.  The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and EST data mining , 1998, Bioinform..

[22]  David Jordan,et al.  The Object Database Standard: ODMG 2.0 , 1997 .

[23]  Carole A. Goble,et al.  TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources , 1998, ISMB.

[24]  Peter Buneman,et al.  Challenges in Integrating Biological Data Sources , 1995, J. Comput. Biol..

[25]  Jaime Prilusky,et al.  GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support , 1998, Bioinform..

[26]  R. Durbin,et al.  Using GeneWise in the Drosophila annotation experiment. , 2000, Genome research.

[27]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[28]  I-Min A Chen,et al.  An Overview of the Object-Protocol Model (OPM) and OPM Data Management Tools , 1995, Inf. Syst..

[29]  Thure Etzold,et al.  SRS - an indexing and retrieval tool for flat file data libraries , 1993, Comput. Appl. Biosci..

[30]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[31]  T. Attwood,et al.  PRINTS--a database of protein motif fingerprints. , 1994, Nucleic acids research.

[32]  David W. Shipman The functional data model and the data language DAPLEX , 1979, SIGMOD '79.