The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and EST data mining

MOTIVATION To make effective use of the vast amounts of expressed sequence tag (EST) sequence data generated by the Merck-sponsored EST project and other similar efforts, sequences must be organized into gene classes, and scientists must be able to 'mine' the gene class data in the context of related genomic data. RESULTS This paper presents the Merck Gene Index browser, an easily extensible, World Wide Web-based system for mining the Merck Gene Index (MGI) and related genomic data. The MGI is a non-redundant set of clones and sequences, each representing a distinct gene, constructed from all high-quality 3' EST sequences generated by the Merck-sponsored EST project. The MGI browser integrates data from a variety of sources and storage formats, both local and remote, using an eclectic integration strategy, including a federation of relational databases, a local data warehouse and simple hypertext links. Data currently integrated include: LENS cDNA clone and EST data, dbEST protein and non-EST nucleic acid similarity data, WashU sequence chromatograms. Entrez sequence and Medline entries, and UniGene gene clusters. Flatfile sequence data are accessed using the Bioapps server, an internally developed client-server system that supports generic sequence analysis applications. Browser data are retrieved and formatted by means of the Bioinformatics Data Integration Toolkit (B-DIT), a new suite of Perl routines.

[1]  S. Bentolila,et al.  The Genexpress Index: a resource for gene discovery and the genic map of the human genome. , 1995, Genome research.

[2]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank: current status. , 1994, Nucleic acids research.

[3]  C. Auffray,et al.  The I.M.A.G.E. Consortium: an integrated molecular analysis of genomes and their expression. , 1996, Genomics.

[4]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[5]  Kousaku Okubo,et al.  Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression , 1992, Nature Genetics.

[6]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[7]  Limsoon Wong,et al.  A Data Transformation System for Biological Data Sources , 1995, VLDB.

[8]  J. Sikela,et al.  Gene–based sequence–tagged–sites (STSs) as the basis for a human gene map , 1995, Nature Genetics.

[9]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[10]  S. Duprat,et al.  [IMAGE: molecular integration of the analysis of the human genome and its expression]. , 1995, Comptes rendus de l'Academie des sciences. Serie III, Sciences de la vie.

[11]  E. Mardis,et al.  Generation and analysis of 280,000 human expressed sequence tags. , 1996, Genome research.

[12]  Bjarne Stroustrup,et al.  The C++ programming language (2nd ed.) , 1991 .

[13]  J. Sikela,et al.  Use of 3' untranslated sequences of human cDNAs for rapid chromosome assignment and conversion to STSs: implications for an expression map of the genome. , 1991, Nucleic acids research.

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  B P Gaber,et al.  NRL-3D: a sequence-structure database derived from the protein data bank (PDB) and searchable within the PIR environment. , 1990, Protein sequences & data analysis.

[16]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[17]  Richard Blevins,et al.  PROFILER: a tool for automatic searching of internally maintained databases , 1995, Comput. Appl. Biosci..

[18]  Dennis McLeod,et al.  A federated architecture for information management , 1985, TOIS.

[19]  D Benton,et al.  Bioinformatics--principles and potential of a new multidisciplinary tool. , 1996, Trends in biotechnology.

[20]  K. O. Elliston,et al.  Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. , 1996, Genome research.

[21]  P. Deloukas,et al.  A Gene Map of the Human Genome , 1996, Science.

[22]  Kathryn E. Sidman,et al.  The protein identification resource (PIR). , 1986, Nucleic acids research.

[23]  A. Bairoch,et al.  PROSITE: recent developments. , 1994, Nucleic acids research.

[24]  R. Fleischmann,et al.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. , 1995, Nature.

[25]  Acknowledgements , 2018, Acknowledgements.

[26]  Laura M. Haas,et al.  Towards heterogeneous multimedia information systems: the Garlic approach , 1995, Proceedings RIDE-DOM'95. Fifth International Workshop on Research Issues in Data Engineering-Distributed Object Management.

[27]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[28]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[29]  Cathy H. Wu,et al.  The PIR-International Protein Sequence Database , 1999, Nucleic Acids Res..

[30]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..