The encyclopedia of life project: Grid software and deployment

The ongoing global effort of genome sequencing is making large scale comparative proteomic analysis an intriguing task. The Encyclopedia of Life (EOL; http://eol.sdsc.edu) project aims to provide current functional and structural annotations for all available proteomes, a computational challenge never seen before in biology. Using an integrative genome annotation pipeline (iGAP), we have produced 3D models and functional annotations for more than 100 proteomes thus far. This process is greatly facilitated by grid compute resources, and especially by the development of grid application execution environment. AppLeS (Application-Level Scheduling) Parameter Sweep Template (APST) has been adopted by the EOL project as a mediator to grid middleware. APST has made the annotation process much more efficient, highly automated and scalable. Currently we are building a domain-specific bioinformatics workflow management system (BWMS) on top of APST, which further streamlines grid deployment of life science applications. With these developments in mind, we discuss some common problems and expectations of grid computing for high throughput proteomics.

[1]  Patrice Koehl,et al.  ASTRAL compendium enhancements , 2002, Nucleic Acids Res..

[2]  Henri Casanova,et al.  NetSovle: A Network Server for Solving Computational Science Problems , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[3]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[4]  Søren Brunak,et al.  A Neural Network Method for Identification of Prokaryotic and Eukaryotic Signal Peptides and Prediction of their Cleavage Sites , 1997, Int. J. Neural Syst..

[5]  Francine Berman,et al.  Heuristics for scheduling parameter sweep applications in grid environments , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[6]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[7]  N N Alexandrov,et al.  Alignment algorithm for homology modeling and threading , 1998, Protein science : a publication of the Protein Society.

[8]  G. Fox,et al.  The Grid: past, present, future , 2003 .

[9]  Warren Smith,et al.  A Resource Management Architecture for Metacomputing Systems , 1998, JSSPP.

[10]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[11]  Francine Berman,et al.  Adaptive Computing on the Grid Using AppLeS , 2003, IEEE Trans. Parallel Distributed Syst..

[12]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[13]  Greg B. Quinn,et al.  A comparative proteomics resource: proteins of Arabidopsis thaliana , 2003, Genome Biology.

[14]  Francine Berman,et al.  Overview of the Book: Grid Computing – Making the Global Infrastructure a Reality , 2003 .

[15]  D Fischer,et al.  Analysis of topological and nontopological structural similarities in the PDB: New examples with old structures , 1996, Proteins.

[16]  Ilya N. Shindyalov,et al.  PDP: protein domain parser , 2003, Bioinform..

[17]  Philip E. Bourne,et al.  A database and tools for 3-D protein structure comparison and alignment using the Combinatorial Extension (CE) algorithm , 2001, Nucleic Acids Res..

[18]  Henri Casanova,et al.  Netsolve: a Network-Enabled Server for Solving Computational Science Problems , 1997, Int. J. High Perform. Comput. Appl..

[19]  Francine Berman,et al.  Application-Level Scheduling on Distributed Heterogeneous Networks , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[20]  Erik L. L. Sonnhammer,et al.  A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences , 1998, ISMB.

[21]  Henri Casanova,et al.  Parameter Sweeps on the Grid with APST , 2003 .

[22]  Tim J. P. Hubbard,et al.  SCOP database in 2002: refinements accommodate structural genomics , 2002, Nucleic Acids Res..

[23]  Ian T. Foster,et al.  GASS: a data movement and access service for wide area computing systems , 1999, IOPADS '99.

[24]  P. Bourne,et al.  The New Biology and the Grid , 2003 .

[25]  Jack Dongarra,et al.  Sourcebook of parallel computing , 2003 .

[26]  Ian T. Foster,et al.  Grid information services for distributed resource sharing , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[27]  CasanovaHenri,et al.  The encyclopedia of life project , 2004 .

[28]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[29]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..