Data Access and Integration in the ISPIDER Proteomics Grid

Grid computing has great potential for supporting the integration of complex, fast changing biological data repositories to enable distributed data analysis. One scenario where Grid computing has such potential is provided by proteomics resources which are rapidly being developed with the emergence of affordable, reliable methods to study the proteome. The protein identifications arising from these methods derive from multiple repositories which need to be integrated to enable uniform access to them. A number of technologies exist which enable these resources to be accessed in a Grid environment, but the independent development of these resources means that significant data integration challenges, such as heterogeneity and schema evolution, have to be met. This paper presents an architecture which supports the combined use of Grid data access (OGSA-DAI), Grid distributed querying (OGSA-DQP) and data integration (AutoMed) software tools to support distributed data analysis. We discuss the application of this architecture for the integration of several autonomous proteomics data resources.

[1]  Bart De Moor,et al.  BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis , 2005, Bioinform..

[2]  Lucas Zamboulis,et al.  Processing IQL queries and migrating data in the automed toolkit , 2003 .

[3]  Norman W. Paton,et al.  The design and implementation of Grid database services in OGSA‐DAI , 2005, Concurr. Pract. Exp..

[4]  Rolf Apweiler,et al.  The EBI SRS Server: Recent Developments , 2002, German Conference on Bioinformatics.

[5]  David Jordan,et al.  The Object Database Standard: ODMG 2.0 , 1997 .

[6]  Jim Smith,et al.  Service-Based Distributed Querying on the Grid , 2003, ICSOC.

[7]  Alexandra Poulovassilis,et al.  Data integration by bi-directional schema transformation rules , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[8]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[9]  Alexandra Poulovassilis,et al.  Defining Peer-to-Peer Data Integration Using Both as View Rules , 2003, DBISP2P.

[10]  Tao Xu,et al.  Atlas – a data warehouse for integrative bioinformatics , 2005, BMC Bioinformatics.

[11]  Jim Smith,et al.  Distributed Query Processing on the Grid , 2003, Int. J. High Perform. Comput. Appl..

[12]  Alexandra Poulovassilis,et al.  Cluster Based Integration of Heterogeneous Biological Databases Using the AutoMed Toolkit , 2005, DILS.

[13]  Hujun Yin,et al.  PepSeeker: a database of proteome peptide identifications for investigating fragmentation patterns , 2005, Nucleic Acids Res..

[14]  Limsoon Wong,et al.  BioKleisli: a digital library for biomedical researchers , 1997, International Journal on Digital Libraries.

[15]  Dan Suciu,et al.  Comprehension syntax , 1994, SGMD.

[16]  Alistair J. P. Brown,et al.  PEDRo: A database for storing, searching and disseminating experimental proteomics data , 2004, BMC Genomics.

[17]  Manish Parashar,et al.  Grid Computing — GRID 2002 , 2002, Lecture Notes in Computer Science.

[18]  Shahrokh Saeednia,et al.  How to maintain both privacy and authentication in digital libraries , 2000 .

[19]  Bertram Ludäscher,et al.  An Ontology-Driven Framework for Data Transformation in Scientific Workflows , 2004, DILS.

[20]  Robertson Craig,et al.  Open source system for analyzing, validating, and storing protein identification data. , 2004, Journal of proteome research.

[21]  Carole A. Goble,et al.  Transparent access to multiple bioinformatics information sources , 2001, IBM Syst. J..

[22]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[23]  Rolf Apweiler,et al.  The Integr8 project - a resource for genomic and proteomic data , 2004, Silico Biol..

[24]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..