Efficient 3D Protein Structure Alignment on Large Hadoop Clusters in Microsoft Azure Cloud

Exploration of 3D protein structures provides a broad potential for possible applications of its results in medical diagnostics, drug design, and treatment of patients. 3D protein structure similarity searching is one of the important exploration processes performed in structural bioinformatics. However, the process is time-consuming and requires increased computational resources when performed against large repositories. In this paper, we show that 3D protein structure similarity searching can be significantly accelerated by using modern processing techniques and computer architectures. Results of our experiments prove that by distributing computations on large Hadoop/HBase (HDInsight) clusters and scaling them out and up in the Microsoft Azure public cloud we can reduce the execution times of similarity search processes from hundred of hours to minutes. We will also show that the utilization of public clouds to perform scientific computations is very beneficial and can be successfully applied when scaling time-consuming computations over a mass of biological data.

[1]  Eyke Hüllermeier,et al.  GPU-based Cloud computing for comparing the structure of protein binding sites , 2012, 2012 6th IEEE International Conference on Digital Ecosystems and Technologies (DEST).

[2]  Chi-Ren Shyu,et al.  Accelerating large-scale protein structure alignments with graphics processing units , 2012, BMC Research Notes.

[3]  Dariusz Mrozek,et al.  HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud , 2016, Inf. Sci..

[4]  Yaw-Ling Lin,et al.  Implementation of a Parallel Protein Structure Alignment Service on Cloud , 2013, International journal of genomics.

[5]  Andreas Prlic,et al.  Pre-calculated protein structure alignments at the RCSB PDB website , 2010, Bioinform..

[6]  Haruki Nakamura,et al.  PDBML: the representation of archival macromolecular structure data in XML , 2005, Bioinform..

[7]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[8]  Philip E. Bourne,et al.  The Macromolecular Crystallographic Information File (mmCIF) , 2001 .

[9]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[10]  Dariusz Mrozek,et al.  P3D-SQL: Extending Oracle PL/SQL Capabilities Towards 3D Protein Structure Similarity Searching , 2015, IWBBIO.

[11]  Adam Godzik,et al.  Flexible structure alignment by chaining aligned fragment pairs allowing twists , 2003, ECCB.

[12]  Eyke Hüllermeier,et al.  CavSimBase: A Database for Large Scale Comparison of Protein Binding Sites , 2016, IEEE Transactions on Knowledge and Data Engineering.

[13]  Dariusz Mrozek,et al.  Cloud4Psi: cloud computing for 3D protein structure similarity searching , 2014, Bioinform..

[14]  Peter J. Stuckey,et al.  Fast and accurate protein substructure searching with simulated annealing and GPUs , 2010, BMC Bioinformatics.

[15]  Philip E. Bourne,et al.  [30] Macromolecular crystallographic information file , 1997 .

[16]  Andreas Prlic,et al.  BioJava: an open-source framework for bioinformatics in 2012 , 2012, Bioinform..

[17]  Liisa Holm,et al.  Searching protein structure databases with DaliLite v.3 , 2008, Bioinform..

[18]  Bernd Freisleben,et al.  GPU-Based Point Cloud Superpositioning for Structural Comparisons of Protein Binding Sites , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Dariusz Mrozek,et al.  Accelerating 3D Protein Structure Similarity Searching on Microsoft Azure Cloud with Local Replicas of Macromolecular Data , 2015, PPAM.

[20]  Dariusz Mrozek,et al.  PSS-SQL: Protein Secondary Structure - Structured Query Language , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[21]  John D Westbrook,et al.  The PDB format, mmCIF, and other data formats. , 2003, Methods of biochemical analysis.

[22]  Dariusz Mrozek,et al.  An efficient and flexible scanning of databases of protein secondary structures , 2014, Journal of Intelligent Information Systems.

[23]  Dariusz Mrozek,et al.  High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model , 2018, Knowledge and Information Systems.

[24]  Dariusz Mrozek,et al.  CASSERT: A Two-Phase Alignment Algorithm for Matching 3D Structures of Proteins , 2013, CN.

[25]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[26]  Bożena Małysiak-Mrozek,et al.  Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA , 2014, Journal of Molecular Modeling.

[27]  Scott Hazelhurst,et al.  PH2: an hadoop-based framework for mining structural properties from the PDB database , 2010, SAICSIT '10.

[28]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[29]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.