BioExtract Server—An Integrated Workflow-Enabling System to Access and Analyze Heterogeneous, Distributed Biomolecular Data

Many in silico investigations in bioinformatics require access to multiple, distributed data sources and analytic tools. The requisite data sources may include large public data repositories, community databases, and project databases for use in domain-specific research. Different data sources frequently utilize distinct query languages and return results in unique formats, and therefore researchers must either rely upon a small number of primary data sources or become familiar with multiple query languages and formats. Similarly, the associated analytic tools often require specific input formats and produce unique outputs which make it difficult to utilize the output from one tool as input to another. The BioExtract Server (http://bioextract.org) is a Web-based data integration application designed to consolidate, analyze, and serve data from heterogeneous biomolecular databases in the form of a mash-up. The basic operations of the BioExtract Server allow researchers, via their Web browsers, to specify data sources, flexibly query data sources, apply analytic tools, download result sets, and store query results for later reuse. As a researcher works with the system, their ¿steps¿ are saved in the background. At any time, these steps can be preserved long-term as a workflow simply by providing a workflow name and description.

[1]  Otto Ritter,et al.  Characterizing Heterogeneous Molecular Biology Database Systems , 1995, J. Comput. Biol..

[2]  Priyanka Gupta,et al.  BioWarehouse: a bioinformatics database warehouse toolkit , 2006, BMC Bioinformatics.

[3]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[4]  Rolf Apweiler,et al.  The EBI SRS Server: Recent Developments , 2002, German Conference on Bioinformatics.

[5]  Matthew D. Wilkerson,et al.  PlantGDB: a resource for comparative plant genomics , 2007, Nucleic Acids Res..

[6]  Alejandra Cechich,et al.  An ontology approach to data integration , 2003 .

[7]  Carole A. Goble,et al.  TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources , 1998, ISMB.

[8]  Val Tannen,et al.  K2/Kleisli and GUS: Experiments in integrated access to genomic data sources , 2001, IBM Syst. J..

[9]  Gavin Sherlock,et al.  The Stanford Microarray Database: implementation of new analysis tools and open source release of software , 2002, Nucleic Acids Res..

[10]  Carole A. Goble,et al.  Query processing in the TAMBIS bioinformatics source integration system , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[11]  Ranjan Sinha,et al.  HAT-Trie: A Cache-Conscious Trie-Based Data Structure For Strings , 2007, ACSC.

[12]  Declan Butler,et al.  Mashups mix data into global service , 2006, Nature.

[13]  Mark D. Wilkinson,et al.  BioMOBY: An Open Source Biological Web Services Proposal , 2002, Briefings Bioinform..

[14]  Joachim Hammer,et al.  A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources , 2000 .

[15]  John C. Wooley,et al.  Challenges Faced in the Integration of Biological Information , 2003, Bioinformatics.

[16]  Stijn van Dongen,et al.  miRBase: microRNA sequences, targets and gene nomenclature , 2005, Nucleic Acids Res..

[17]  J. T. Stout,et al.  Positive selection for single amino acid change promotes substrate discrimination of a plant volatile-producing enzyme. , 2007, Molecular biology and evolution.

[18]  Bertram Ludäscher,et al.  Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life , 2008, IPAW.

[19]  Subbarao Kambhampati,et al.  Integration of biological sources: current systems and challenges ahead , 2004, SGMD.

[20]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..

[21]  Stephan Philippi Light-weight integration of molecular biological databases , 2004, Bioinform..

[22]  Michael Y. Galperin The Molecular Biology Database Collection: 2007 update , 2006, Nucleic Acids Res..

[23]  Robert Stevens,et al.  Complex Query Formulation Over Diverse Information Sources Using an Ontology , 2007 .

[24]  Yolanda Gil,et al.  Workshop on the Challenges of Scientific Workflows , 2006 .

[25]  Hugh E. Williams,et al.  Burst tries: a fast, efficient data structure for string keys , 2002, TOIS.

[26]  Dan Wu,et al.  EMBL Nucleotide Sequence Database in 2006 , 2006, Nucleic Acids Res..

[27]  Edward A. Lee,et al.  Composing Different Models of Computation in Kepler and Ptolemy II , 2007, International Conference on Computational Science.

[28]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[29]  Ali R. Hurson,et al.  A taxonomy and current issues in multidatabase systems , 1992, Computer.

[30]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[31]  Val Tannen,et al.  The Information Integration System K2 , 2003, Bioinformatics.

[32]  Gabriele Ausiello,et al.  MINT: the Molecular INTeraction database , 2006, Nucleic Acids Res..

[33]  Hugh E. Williams,et al.  Comparing Compressed Sequences for Faster Nucleotide BLAST Searches , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  Jano I. van Hemert,et al.  Scientific Workflow: A Survey and Research Directions , 2007, PPAM.

[35]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[36]  T. Oinn,et al.  Soaplab - a unified Sesame door to analysis tools , 2003 .

[37]  Michael R. Genesereth,et al.  The Conceptual Basis for Mediation Services , 1997, IEEE Expert.

[38]  Carlos Alberto Heuser,et al.  Integrating Biological Databases , 2003, SBBD.

[39]  Tim Hui-Ming Huang,et al.  MPromDb: an integrated resource for annotation and visualization of mammalian gene promoters and ChIP-chip experimental data , 2005, Nucleic Acids Res..

[40]  Richard Monson-Haefel,et al.  Java message service , 2000 .

[41]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.

[42]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..