Towards High-Throughput, Multi-Criteria Protein Structure Comparison and Analysis

Protein Structure Comparison (PSC) is an essential component of biomedical research as it impacts on, e.g., drug design, molecular docking, protein folding and structure prediction algorithms as well as being essential to the assessment of these predictions. Each of these applications, as well as many others where molecular comparison plays an important role, requires a different notion of similarity that naturally lead to the Multi-Criteria Protein Structure Comparison (MCPSC) problem. ProCKSI (www.procksi.org), provides algorithmic solutions for the MC-PSC problem by means of an enhanced structural comparison that relies on the principled application of information fusion to similarity assessments derived from multiple comparison methods. Current MC-PSC works well for moderately sized data sets and it is time consuming as it provides public service to multiple users. Many of the structural bioinformatics applications mentioned above would benefit from the ability to perform, for a dedicated user, thousands or tens of thousands of comparisons through multiple methods in realtime, a capacity beyond our current technology. In this paper we take a key step into that direction by means of a highthroughput distributed re-implementation of ProCKSI for very large data sets. The core of the proposed framework lies in the design of an innovative distributed algorithm that runs on each compute node in a cluster/grid environment to perform structure comparison of a given subset of input structures using some of the most popular PSC methods (e.g. USM, MaxCMO, Fast, DaliLite, CE and TMalign). We follow this with a procedure of distributed consensus building. Thus the new algorithms proposed here achieve ProCKSI’s similarity assessment quality but with a fraction of the time required by it. Our results show that the proposed distributed method can be used efficiently to compare a) a particular protein against a very large protein structures data set (target-against-all comparison), b) a particular very large scale dataset against itself or against another very large scale dataset (all-against-all comparison). We conclude the paper by enumerating some of the outstanding challenges for real-time MC-PSC.

[1]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Wu-chun Feng,et al.  A Pluggable Framework for Parallel Pairwise Sequence Search , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[3]  Proteomics' new order , 2005, Nature.

[4]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[5]  A. Williamson Creating a structural genomics consortium , 2000, Nature Structural Biology.

[6]  K. Nishikawa,et al.  Predicting absolute contact numbers of native protein structure from amino acid sequence , 2004, Proteins.

[7]  L. Milanesi,et al.  A Fast Job Scheduling System for a Wide Range of Bioinformatic Applications , 2007, IEEE Transactions on NanoBioscience.

[8]  Ambuj K. Singh,et al.  Integrating multi-attribute similarity networks for robust representation of the protein space , 2006, Bioinform..

[9]  Zhiping Weng,et al.  FAST: A novel protein structure alignment algorithm , 2004, Proteins.

[10]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[11]  Liisa Holm,et al.  DaliLite workbench for protein structure comparison , 2000, Bioinform..

[12]  Natalio Krasnogor,et al.  Measuring the similarity of protein structures by means of the universal similarity metric , 2004, Bioinform..

[13]  Jarek Nieplocha,et al.  ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis , 2006, IEEE Transactions on Parallel and Distributed Systems.

[14]  Steven Skiena,et al.  Heterogeneous Data Integration with the Consensus Clustering Formalism , 2004, DILS.

[15]  I. Merelli,et al.  Evaluation of a Grid Based Molecular Dynamics Approach for Polypeptide Simulations , 2007, IEEE Transactions on NanoBioscience.

[16]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[17]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[18]  J. Skolnick,et al.  Ab initio modeling of small proteins by iterative TASSER simulations , 2007, BMC Biology.

[19]  Mario Cannataro,et al.  Modelling a Protein Structure Comparison Application on the Grid Using PROTEUS , 2004, SAG.

[20]  Thomas L. Casavant,et al.  Parallelization of local BLAST service on workstation clusters , 2001, Future Gener. Comput. Syst..

[21]  C Venclovas,et al.  Processing and analysis of CASP3 protein structure predictions , 1999, Proteins.

[22]  Ian T. Foster,et al.  Globus Toolkit Version 4: Software for Service-Oriented Systems , 2005, Journal of Computer Science and Technology.

[23]  Yi Pan,et al.  Distributed Sequence Alignment Applications for the Public Computing Architecture , 2008, IEEE Transactions on NanoBioscience.

[24]  Weng-Long Chang Fast Parallel DNA-Based Algorithms for Molecular Computation: The Set-Partition Problem , 2007, IEEE Transactions on NanoBioscience.

[25]  Heshan Lin,et al.  Massively parallel genomic sequence search on the Blue Gene/P architecture , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Masayuki Yamamura,et al.  FROG (Fitted Rotation and Orientation of protein structure by means of real-coded Genetic algorithm) : Asynchronous Parallelizing for Protein Structure-Based Comparison on the Basis of Geometrical Similarity , 2002 .

[27]  C A Johnson,et al.  Parallel computing in biomedical research. , 1994, Science.

[28]  J. Marcos Moreno-Vega,et al.  A simple and fast heuristic for protein structure comparison , 2008, BMC Bioinformatics.

[29]  Carlo Ferrari,et al.  A grid-aware approach to protein structure comparison , 2003, J. Parallel Distributed Comput..

[30]  Daisuke Kihara,et al.  Ab initio protein structure prediction on a genomic scale: Application to the Mycoplasma genitalium genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[31]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[32]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[33]  James E. Bray,et al.  The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[34]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[35]  G. Barton Scop: structural classification of proteins. , 1994, Trends in biochemical sciences.

[36]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[37]  Ian T. Foster,et al.  MPICH-G2: A Grid-enabled implementation of the Message Passing Interface , 2002, J. Parallel Distributed Comput..

[38]  Daisuke Takaya,et al.  Protein structure prediction in structure based drug design. , 2004, Current medicinal chemistry.

[39]  Arun Krishnan GridBLAST: a Globus‐based high‐throughput implementation of BLAST in a Grid computing framework , 2005, Concurr. Comput. Pract. Exp..

[40]  S. Merler,et al.  A Grid Environment for High-Throughput Proteomics , 2007, IEEE Transactions on NanoBioscience.

[41]  Xiaoqin Zou,et al.  Efficient molecular docking of NMR structures: Application to HIV‐1 protease , 2006, Protein science : a publication of the Protein Society.

[42]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[43]  Klara Kedem,et al.  Finding the Consensus Shape for a Protein Family , 2003, Algorithmica.

[44]  B. Matthews Protein Structure Initiative: getting into gear , 2007, Nature Structural &Molecular Biology.

[45]  Oswaldo Trelles,et al.  On the Parallelisation of Bioinformatics Applications , 2001, Briefings Bioinform..

[46]  Rogério Luís de Carvalho Costa,et al.  Database Allocation Strategies for Parallel BLAST Evaluation on Clusters , 2004, Distributed and Parallel Databases.

[47]  David Baker,et al.  Protein Structure Prediction Using Rosetta , 2004, Numerical Computer Methods, Part D.

[48]  Natalio Krasnogor,et al.  Grid and Distributed Public Computing Schemes for Structural Proteomics: A Short Overview , 2007, ISPA Workshops.

[49]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[50]  Emilio L. Zapata,et al.  On an efficient parallelization of exhaustive sequence comparison algorithms on message passing architectures , 1994, Comput. Appl. Biosci..

[51]  Edmund K. Burke,et al.  A fuzzy sets based generalization of contact maps for the overlap of protein structures , 2005, Fuzzy Sets Syst..

[52]  J. Skolnick,et al.  Automated structure prediction of weakly homologous proteins on a genomic scale. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Rachel Kolodny,et al.  Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. , 2005, Journal of molecular biology.

[54]  G. Lonsdale,et al.  A Service-Oriented Grid Infrastructure for Biomedical Data and Compute Services , 2007, IEEE Transactions on NanoBioscience.