A case study of high-throughput biological data processing on parallel platforms

MOTIVATION Analysis of large biological data sets using a variety of parallel processor computer architectures is a common task in bioinformatics. The efficiency of the analysis can be significantly improved by properly handling redundancy present in these data combined with taking advantage of the unique features of these compute architectures. RESULTS We describe a generalized approach to this analysis, but present specific results using the program CEPAR, an efficient implementation of the Combinatorial Extension algorithm in a massively parallel (PAR) mode for finding pairwise protein structure similarities and aligning protein structures from the Protein Data Bank. CEPAR design and implementation are described and results provided for the efficiency of the algorithm when run on a large number of processors. AVAILABILITY Source code is available by contacting one of the authors.

[1]  Sung-Hou Kim,et al.  Overview of structural genomics: from structure to function. , 2003, Current opinion in chemical biology.

[2]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[3]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[4]  Carole A. Goble,et al.  myGrid: personalised bioinformatics on the information grid , 2003, ISMB.

[5]  D E Foulser,et al.  Parallel computation of multiple biological sequence comparisons. , 1990, Computers and biomedical research, an international journal.

[6]  Gunnar von Heijne,et al.  Fast Needleman-Wunsch scanning of sequence databanks on a massively parallel computer , 1993, Comput. Appl. Biosci..

[7]  Peter F. Stadler,et al.  Prediction of RNA Base Pairing Probabilities on Massively Parallel Computers , 2000, J. Comput. Biol..

[8]  Philip E. Bourne,et al.  Protein data representation and query using optimized data decomposition , 1997, Comput. Appl. Biosci..

[9]  R C Brower,et al.  Impact of massively parallel computation on protein structure determination. , 1992, Critical reviews in biomedical engineering.

[10]  A. Godzik The structural alignment between two proteins: Is there a unique answer? , 1996, Protein science : a publication of the Protein Society.

[11]  Rajendra Kulkarni,et al.  Multiple alignment of sequences on parallel computers , 1993, Comput. Appl. Biosci..

[12]  M. Natália D. S. Cordeiro,et al.  Parallel Implementation of a Monte Carlo Molecular Simulation Program , 2000, J. Chem. Inf. Comput. Sci..

[13]  C. Sander,et al.  Searching protein structure databases has come of age , 1994, Proteins.

[14]  Joaquín Dopazo,et al.  Parallel Implementation of DNAml Program on Message-Passing Architectures , 1998, Parallel Comput..

[15]  R. Jones Sequence pattern matching on a massively parallel computer , 1992, Comput. Appl. Biosci..

[16]  Hugh B. Nicholas,et al.  Implementation of Genetic Sequence Alignment Programs on Supercomputers , 1997, The Journal of Supercomputing.

[17]  Jin Chu Wu,et al.  The massively parallel genetic algorithm for RNA folding: MIMD implementation and population variation , 2001, Bioinform..

[18]  P E Bourne,et al.  Selecting a processor for computations in molecular biophysics. , 1988, Computers in biology and medicine.

[19]  William Noble Grundy,et al.  ParaMEME: a parallel implementation and a web interface for a DNA and protein motif discovery tool , 1996, Comput. Appl. Biosci..

[20]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[21]  Wilfred W. Li,et al.  A comparative proteomics resource: proteins of Arabidopsis thaliana , 2003, Genome Biology.

[22]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[23]  David W. Deerfield,et al.  Selective and sensitive comparison of genetic sequence data , 1999 .

[24]  W R Taylor,et al.  Fast structure alignment for protein databank searching , 1992, Proteins.

[25]  Suchendra M. Bhandarkar,et al.  Parallel Computing of Physical Maps--A Comparative Study in SIMD and MIMD Parallelism , 1996, J. Comput. Biol..

[26]  Craig A. Stewart,et al.  Parallel computing in biomedical research and the search for peta-scale biomedical applications , 2003, PARCO.

[27]  M J Sippl,et al.  Optimum superimposition of protein structures: ambiguities and implications. , 1996, Folding & design.

[28]  Perry L. Miller,et al.  Parallel computation and FASTA: confronting the problem of parallel database search for a fast sequence comparison algorithm , 1991, Comput. Appl. Biosci..

[29]  N N Alexandrov,et al.  Biological meaning, statistical significance, and classification of local spatial similarities in nonhomologous proteins , 1994, Protein science : a publication of the Protein Society.

[30]  Perry L. Miller,et al.  Harnessing networked workstations as a powerful parallel computer: a general paradigm illustrated using three programs for genetic linkage analysis , 1992, Comput. Appl. Biosci..

[31]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[32]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[33]  Philip E. Bourne,et al.  A database and tools for 3-D protein structure comparison and alignment using the Combinatorial Extension (CE) algorithm , 2001, Nucleic Acids Res..