Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment

Calculation of structural features of proteins, nucleic acids, and nucleic acid-protein complexes on the basis of their geometries and studying various interactions within these macromolecules, for which high-resolution structures are stored in Protein Data Bank (PDB), require parsing and extraction of suitable data stored in text files. To perform these operations on large scale in the face of the growing amount of macromolecular data in public repositories, we propose to perform them in the distributed environment of Azure Data Lake and scale the calculations on the Cloud. In this paper, we present dedicated data extractors for PDB files that can be used in various types of calculations performed over protein and nucleic acids structures in the Azure Data Lake. Results of our tests show that the Cloud storage space occupied by the macromolecular data can be successfully reduced by using compression of PDB files without significant loss of data processing efficiency. Moreover, our experiments show that the performed calculations can be significantly accelerated when using large sequential files for storing macromolecular data and by parallelizing the calculations and data extractions that precede them. Finally, the paper shows how all the calculations can be performed in a declarative way in U-SQL scripts for Data Lake Analytics.

[1]  Dariusz Mrozek,et al.  Accelerating 3D Protein Structure Similarity Searching on Microsoft Azure Cloud with Local Replicas of Macromolecular Data , 2015, PPAM.

[2]  Dariusz Mrozek,et al.  High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model , 2018, Knowledge and Information Systems.

[3]  Bożena Małysiak-Mrozek,et al.  Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA , 2014, Journal of Molecular Modeling.

[4]  Andreas Prlic,et al.  BioJava: an open-source framework for bioinformatics in 2012 , 2012, Bioinform..

[5]  Che-Lun Hung,et al.  Open Reading Frame Phylogenetic Analysis on the Cloud , 2013, International journal of genomics.

[6]  Haruki Nakamura,et al.  PDBML: the representation of archival macromolecular structure data in XML , 2005, Bioinform..

[7]  Dariusz Mrozek High-Performance Computational Solutions in Protein Bioinformatics , 2014, SpringerBriefs in Computer Science.

[8]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[9]  Pinak Chakrabarti,et al.  IntGeom: A Server for the Calculation of the Interaction Geometry between Planar Groups in Proteins , 2009 .

[10]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[11]  Dariusz Mrozek,et al.  Scaling Ab Initio Predictions of 3D Protein Structures in Microsoft Azure Cloud , 2015, Journal of Grid Computing.

[12]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[13]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[14]  Philip E. Bourne,et al.  [30] Macromolecular crystallographic information file , 1997 .

[15]  Marco Masseroli,et al.  Data Management for Heterogeneous Genomic Datasets , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  John D Westbrook,et al.  The PDB format, mmCIF, and other data formats. , 2003, Methods of biochemical analysis.

[17]  Quan Zou,et al.  O‐GlcNAcPRED‐II: an integrated classification algorithm for identifying O‐GlcNAcylation sites based on fuzzy undersampling and a K‐means PCA oversampling technique , 2018, Bioinform..

[18]  Dariusz Mrozek,et al.  HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud , 2016, Inf. Sci..

[19]  Stéphane Bressan,et al.  Introduction to Database Systems , 2005 .

[20]  Dariusz Mrozek,et al.  An efficient and flexible scanning of databases of protein secondary structures , 2014, Journal of Intelligent Information Systems.

[21]  Ying Ju,et al.  Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy , 2016, BMC Systems Biology.

[22]  Marco Masseroli,et al.  Integration and Querying of Genomic and Proteomic Semantic Annotations for Biomedical Knowledge Extraction , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  Dariusz Mrozek,et al.  In-Memory Management System for 3D Protein Macromolecular Structures , 2018, Current Proteomics.

[24]  Scott Hazelhurst,et al.  PH2: an hadoop-based framework for mining structural properties from the PDB database , 2010, SAICSIT '10.

[25]  G. N. Sastry,et al.  Aromatic-Aromatic Interactions Database, A(2)ID: an analysis of aromatic π-networks in proteins. , 2011, International journal of biological macromolecules.

[26]  Yong Zhang,et al.  An Interface for Biomedical Big Data Processing on the Tianhe-2 Supercomputer , 2017, Molecules.

[27]  Jignesh M. Patel,et al.  Searching on the Secondary Structure of Protein Sequences , 2002, VLDB.

[28]  Dariusz Mrozek,et al.  PSS-SQL: Protein Secondary Structure - Structured Query Language , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[29]  Narayanaswamy Srinivasan,et al.  Nucleic Acids Research Advance Access published June 21, 2007 PIC: Protein Interactions Calculator , 2007 .

[30]  Dariusz Mrozek,et al.  Cloud4Psi: cloud computing for 3D protein structure similarity searching , 2014, Bioinform..

[31]  Liisa Holm,et al.  Searching protein structure databases with DaliLite v.3 , 2008, Bioinform..

[32]  Yaw-Ling Lin,et al.  Implementation of a Parallel Protein Structure Alignment Service on Cloud , 2013, International journal of genomics.

[33]  Reynold Xin,et al.  Apache Spark , 2016 .

[34]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[35]  Dariusz Mrozek,et al.  Efficient 3D Protein Structure Alignment on Large Hadoop Clusters in Microsoft Azure Cloud , 2018, BDAS.

[36]  Frank Dehne,et al.  SpeeDB: fast structural protein searches , 2015, Bioinform..

[37]  Stephani Joy Y Macalino,et al.  Evolution of In Silico Strategies for Protein-Protein Interaction Drug Discovery , 2018, Molecules.

[38]  Che-Lun Hung,et al.  Cloud Computing for Protein-Ligand Binding Site Comparison , 2013, BioMed research international.

[39]  Dariusz Mrozek,et al.  Scalable Big Data Analytics for Protein Bioinformatics , 2018, Computational Biology.

[40]  J. Patel,et al.  Declarative Querying for Biological Sequences , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[41]  Susie Stephens,et al.  Oracle Database 10g: a platform for BLAST search and Regular Expression pattern matching in life sciences , 2004, Nucleic Acids Res..

[42]  Dariusz Mrozek,et al.  P3D-SQL: Extending Oracle PL/SQL Capabilities Towards 3D Protein Structure Similarity Searching , 2015, IWBBIO.