BioSeek: exploiting source-capability information for integrated access to multiple bioinformatics data sources

Modern Bioinformatics data sources are widely used by molecular biologists for homology searching and new drug discovery. User-friendly and yet responsive access is one of the most desirable properties for integrated access to the rapidly growing, heterogeneous, and distributed collection of data sources. The increasing volume and diversity of digital information related to bioinformatics (such as genomes, protein sequences, protein structures, etc.) have led to a growing problem that conventional data management systems do not have, namely finding which information sources out of many candidate choices are the most relevant and most accessible to answer a given user query. We refer to this problem as the query routing problem. In this paper we introduce the notation and issues of query routing, and present a practical solution for designing a scalable query routing system based on multi-level progressive pruning strategies. The key idea is to create and maintain source capability profiles independently, and to provide algorithms that can dynamically discover relevant information sources for a given query through the smart use of source profiles. Compared to the keyword-based indexing techniques adopted in most of the search engines and software, our approach offers fine-granularity of interest matching, thus it is more powerful and effective for handling queries with complex conditions.

[1]  Limsoon Wong,et al.  BioKleisli: Integrating Biomedical Data and Analysis Packages , 2002 .

[2]  Andreas D. Baxevanis,et al.  The Molecular Biology Database Collection: an online compilation of relevant database resources , 2000, Nucleic Acids Res..

[3]  Anand Rajaraman,et al.  Answering queries using templates with binding patterns (extended abstract) , 1995, PODS.

[4]  Terence Critchlow,et al.  DataFoundry: information management for scientific data , 2000, IEEE Transactions on Information Technology in Biomedicine.

[5]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..

[6]  Victor Markowitz,et al.  OPM: Object-Protocol Model Data Management Tools ’97 , 2002 .

[7]  Carole A. Goble,et al.  Transparent access to multiple bioinformatics information sources , 2001, IBM Syst. J..

[8]  Nam Huyn,et al.  Data analysis and mining in the life sciences , 2001, SGMD.

[9]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[10]  Adam C. Siepel,et al.  An integration platform for heterogeneous bioinformatics software components , 2001, IBM Syst. J..

[11]  Ling Liu,et al.  Query routing in large-scale digital library systems , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[12]  Philip E. Bourne,et al.  STAR/mmCIF: An ontology for macromolecular structure , 2000, Bioinform..