HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud

Abstract 3D protein structure similarity searching is one of the important processes performed in structural bioinformatics, since it allows for protein function identification and reconstruction of phylogeny for weakly related organisms. Due to the complexity of 3D protein structures and exponential growth of protein structures in public repositories, like the Protein Data Bank, the process is time-consuming and requires increased computational resources. This causes the necessity to prepare computer systems to be able to deal with such huge volumes of macromolecular data. In this paper, we show how 3D protein structure similarity searching can be performed in parallel by distributing MapReduce jobs on the HDInsight cluster in Microsoft Azure commercial cloud. Our solution combines the use of two important computing paradigms that gain popularity in recent years—Hadoop/MapReduce and Cloud computing. Our experiments performed with the use of the whole repository of protein structures from Protein Data Bank confirm that such a technological fusion is very beneficial and can be successfully applied when performing time-consuming computations over biological data. Moreover, appropriate preparation of data allows to reduce the time needed for computations and significantly accelerates the similarity searching.

[1]  W. Saenger,et al.  Crystal structure of amylomaltase from thermus aquaticus, a glycosyltransferase catalysing the production of large cyclic glucans. , 2000, Journal of molecular biology.

[2]  Philip E. Bourne,et al.  The Macromolecular Crystallographic Information File (mmCIF) , 2001 .

[3]  David R. Riley,et al.  CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing , 2011, BMC Bioinformatics.

[4]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[5]  Barrie Sosinsky,et al.  Cloud Computing Bible , 2010 .

[6]  M Nilges,et al.  Solution structure of the spectrin repeat: a left-handed antiparallel triple-helical coiled-coil. , 1997, Journal of molecular biology.

[7]  J. Hurley,et al.  Crystal structure of the Cys2 activator-binding domain of protein kinase Cδ in complex with phorbol ester , 1995, Cell.

[8]  Dariusz Mrozek High-Performance Computational Solutions in Protein Bioinformatics , 2014, SpringerBriefs in Computer Science.

[9]  Biswanath Chowdhury,et al.  A cascaded pairwise biomolecular sequence alignment technique using evolutionary algorithm , 2015, Inf. Sci..

[10]  Andreas Prlic,et al.  BioJava: an open-source framework for bioinformatics in 2012 , 2012, Bioinform..

[11]  Dachuan Zhang,et al.  MMDB and VAST+: tracking structural similarities between macromolecular complexes , 2013, Nucleic Acids Res..

[12]  Philip E. Bourne,et al.  [30] Macromolecular crystallographic information file , 1997 .

[13]  Arthur M. Lesk,et al.  Introduction to Protein Science: Architecture, Function, and Genomics , 2001 .

[14]  Shintaro Minami,et al.  MICAN : a protein structure alignment algorithm that can handle Multiple-chains, Inverse alignments, Cα only models, Alternative alignments, and Non-sequential alignments , 2012, BMC Bioinformatics.

[15]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[16]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[17]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[18]  Hao Chen,et al.  Effective inter-residue contact definitions for accurate protein fold recognition , 2012, BMC Bioinformatics.

[19]  Hideki Morimoto,et al.  Crystal Structures of Deoxy- and Carbonmonoxyhemoglobin F1 from the Hagfish Eptatretus burgeri * , 2002, The Journal of Biological Chemistry.

[20]  Kenji Takeda,et al.  Science in the cloud: lessons from three years of research projects on microsoft azure , 2014, ScienceCloud '14.

[21]  Yaw-Ling Lin,et al.  Implementation of a Parallel Protein Structure Alignment Service on Cloud , 2013, International journal of genomics.

[22]  Greg Mann,et al.  The Cyanobactin Heterocyclase Enzyme: A Processive Adenylase That Operates with a Defined Order of Reaction** , 2013, Angewandte Chemie.

[23]  Andreas Prlic,et al.  Pre-calculated protein structure alignments at the RCSB PDB website , 2010, Bioinform..

[24]  Douglas L. Brutlag,et al.  FoldMiner and LOCK 2: protein structure comparison and motif discovery on the web , 2004, Nucleic Acids Res..

[25]  Dariusz Mrozek,et al.  Scaling Ab Initio Predictions of 3D Protein Structures in Microsoft Azure Cloud , 2015, Journal of Grid Computing.

[26]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28]  Nir Ben-Tal,et al.  Introduction to Proteins: Structure, Function, and Motion , 2010 .

[29]  Atanas Radenski,et al.  Speeding-up codon analysis on the cloud with local MapReduce aggregation , 2014, Inf. Sci..

[30]  Riccardo Bellazzi,et al.  The two tryptophans of β2-microglobulin have distinct roles in function and folding and might represent two independent responses to evolutionary pressure , 2011, BMC Evolutionary Biology.

[31]  Forbes J. Burkowski Structural Bioinformatics - An Algorithmic Approach , 2008, Chapman and Hall / CRC mathematical and computational biology series.

[32]  D. Branton,et al.  Crystal structure of the repetitive segments of spectrin. , 1993, Science.

[33]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[34]  Bożena Małysiak-Mrozek,et al.  Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA , 2014, Journal of Molecular Modeling.

[35]  Dariusz Mrozek,et al.  P3D-SQL: Extending Oracle PL/SQL Capabilities Towards 3D Protein Structure Similarity Searching , 2015, IWBBIO.

[36]  Adam Godzik,et al.  Flexible structure alignment by chaining aligned fragment pairs allowing twists , 2003, ECCB.

[37]  Tom White,et al.  Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale (4. ed., revised & updated) , 2012 .

[38]  Konstantinos Krampis,et al.  Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community , 2012, BMC Bioinformatics.

[39]  Dariusz Mrozek,et al.  Fast and Accurate Similarity Searching of Biopolymer Sequences with GPU and CUDA , 2011, ICA3PP.

[40]  Igor Polikarpov,et al.  Identification of a novel ligand binding motif in the transthyretin channel. , 2010, Bioorganic & medicinal chemistry.

[41]  Zhiping Weng,et al.  FAST: A novel protein structure alignment algorithm , 2004, Proteins.

[42]  Dariusz Mrozek,et al.  An efficient and flexible scanning of databases of protein secondary structures , 2014, Journal of Intelligent Information Systems.

[43]  Dariusz Mrozek,et al.  CASSERT: A Two-Phase Alignment Algorithm for Matching 3D Structures of Proteins , 2013, CN.

[44]  Che-Lun Hung,et al.  Cloud Computing for Protein-Ligand Binding Site Comparison , 2013, BioMed research international.

[45]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[46]  Henning Hermjakob,et al.  Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework , 2012, BMC Bioinformatics.

[47]  Scott Hazelhurst,et al.  PH2: an hadoop-based framework for mining structural properties from the PDB database , 2010, SAICSIT '10.

[48]  Dariusz Mrozek,et al.  Cloud4Psi: cloud computing for 3D protein structure similarity searching , 2014, Bioinform..

[49]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[50]  Haruki Nakamura,et al.  PDBML: the representation of archival macromolecular structure data in XML , 2005, Bioinform..

[51]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[52]  Jun Li,et al.  Structural Basis of Constitutive Activity and a Unique Nucleotide Binding Mode of Human Pim-1 Kinase* , 2005, Journal of Biological Chemistry.

[53]  Qinghua Hu,et al.  HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy , 2015, Bioinform..

[54]  Liisa Holm,et al.  Searching protein structure databases with DaliLite v.3 , 2008, Bioinform..

[55]  H. Watson,et al.  The Stereochemistry of the Protein Myoglobin , 1976 .

[56]  John D Westbrook,et al.  The PDB format, mmCIF, and other data formats. , 2003, Methods of biochemical analysis.