A Higher-Order Component for Efficient Genome Processing in the Grid

Computational grids combine computers in the Internet for distributed data processing and are an attractive platform for the data-intensive applications of bioinformatics. We present an extensible genome processing software for the grid and evaluate its performance. Our software was able to discover previously unknown circular permutations (CP) in the ProDom database containing more than 70MB of protein data. A specific feature of our software is its design as a component: the Alignment HOC, a Higher-Order Component that makes use of the latest Globus toolkit as grid middleware. Besides genome data, the Alignment HOC accepts plugin code for processing this data as its input, and contains all the required configuration to run the component on top of Globus, thus, freeing the non-grid-expert user from dealing with grid middleware. Instead of writing data distribution procedures and configuring the middleware appropriately for every new algorithm, Alignment HOC users reuse the existing component and only write application-specific plugins. To maintain plugins persistently in a reusable manner, we built a web-accessible plugin database with a comfortable administration GUI. The flexible component-based implementation makes it easy to study CPs in other databases (e.g. UniProt/Swiss-Prot) or to use an alignment algorithm different than the standard Needleman-Wunsch. For the efficient distribution of workload, we developed a library of group communication operations for HOCs.

[1]  Ian T. Foster Globus Toolkit Version 4: Software for Service-Oriented Systems , 2005, NPC.

[2]  Mine Altunay,et al.  High throughput Web services for life sciences , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[5]  Soon M. Chung,et al.  Role-based access control for the open grid services architecture-data access and integration (ogsa-dai) , 2007 .

[6]  Erich Bornberg-Bauer,et al.  Rapid motif-based prediction of circular permutations in multi-domain proteins , 2005, Bioinform..

[7]  Janusz M Bujnicki,et al.  Sequence permutations in the molecular evolution of DNA methyltransferases , 2002, BMC Evolutionary Biology.

[8]  Albert Jeltsch,et al.  Circular Permutations in the Molecular Evolution of DNA Methyltransferases , 1999, Journal of Molecular Evolution.

[9]  Jack J. Dongarra,et al.  Biological sequence alignment on the computational grid using the GrADS framework , 2005, Future Gener. Comput. Syst..

[10]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[11]  Sergei Gorlatch,et al.  From Grid Middleware to Grid Applications: Bridging the Gap with Hocs , 2004, Future Generation Grids.

[12]  Thomas Rauber,et al.  ORT: a communication library for orthogonal processor groups , 2001, SC '01.

[13]  Denis Caromel,et al.  Efficient, flexible, and typed group communications in Java , 2002, JGI '02.

[14]  Marco Danelutto,et al.  Adaptable Parallel Components for Grid Programming , 2007 .

[15]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.