The application of Hadoop in structural bioinformatics

The paper reviews the use of the Hadoop platform in structural bioinformatics applications. For structural bioinformatics, Hadoop provides a new framework to analyse large fractions of the Protein Data Bank that is key for high-throughput studies of, for example, protein-ligand docking, clustering of protein-ligand complexes and structural alignment. Specifically we review in the literature a number of implementations using Hadoop of high-throughput analyses and their scalability. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. Direct comparisons of Hadoop with batch schedulers are absent in the literature but we note there is some evidence that Message Passing Interface implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop, e.g. Spark improve, usage of cloud platforms (e.g. Azure and Amazon Web Services (AWS)) increases and standardised approaches such as Workflow Languages (i.e. Workflow Definition Language, Common Workflow Language and Nextflow) are taken up.

[1]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[2]  Marek S. Wiewiórka,et al.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision , 2014, Bioinform..

[3]  David T. Jones,et al.  Improving the accuracy of transmembrane protein topology prediction using evolutionary information , 2007, Bioinform..

[4]  Shaoliang Peng,et al.  Bioinformatics applications on Apache Spark , 2018, GigaScience.

[5]  Philip E. Bourne,et al.  The RCSB PDB information portal for structural genomics , 2005, Nucleic Acids Res..

[6]  Ivan Merelli,et al.  Clustering Protein Structures with Hadoop , 2015, CIBB.

[7]  Trilce Estrada,et al.  Automatic selection of near-native protein-ligand conformations using a hierarchical clustering and volunteer computing , 2010, BCB '10.

[8]  Ruth Nussinov,et al.  An overview of recent advances in structural bioinformatics of protein-protein interactions and a guide to their principles. , 2014, Progress in biophysics and molecular biology.

[9]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[10]  Hanan Samet,et al.  An Overview of Quadtrees, Octrees, and Related Hierarchical Data Structures , 1988 .

[11]  Dariusz Mrozek,et al.  Cloud4Psi: cloud computing for 3D protein structure similarity searching , 2014, Bioinform..

[12]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[13]  A. Konagurthu,et al.  MUSTANG: A multiple structural alignment algorithm , 2006, Proteins.

[14]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[15]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[16]  Dusanka Janezic,et al.  ProBiS algorithm for detection of structurally similar protein binding sites by local structural alignment , 2010, Bioinform..

[17]  M. Rawlins Cutting the cost of drug development? , 2004, Nature Reviews Drug Discovery.

[18]  Judy Qiu,et al.  Proceedings of the second international workshop on Emerging computational methods for the life sciences , 2011, HPDC 2011.

[19]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[20]  L Nelson Michael,et al.  A Comparison of Queueing, Cluster and Distributed Computing Systems , 1994 .

[21]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[22]  Laura M. Jackson,et al.  Finding Our Way through Phenotypes , 2015, PLoS biology.

[23]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[24]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[25]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[26]  E J Dodson,et al.  Determination and restrained least-squares refinement of the structures of ribonuclease Sa and its complex with 3'-guanylic acid at 1.8 A resolution. , 1991, Acta crystallographica. Section B, Structural science.

[27]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[28]  Eija Korpelainen,et al.  Hadoop-BAM: directly manipulating next generation sequencing data in the cloud , 2012, Bioinform..

[29]  David G. Messerschmitt,et al.  Software Ecosystem: Understanding an Indispensable Technology and Industry , 2003 .

[30]  Yee Siew Choong,et al.  Minireview: Applied Structural Bioinformatics in Proteomics , 2013, The Protein Journal.

[31]  Sally R. Ellingson,et al.  High-throughput virtual molecular docking: Hadoop implementation of AutoDock4 on a private cloud , 2011, ECMLS '11.

[32]  Andreas Prlic,et al.  MMTF—An efficient file format for the transmission, visualization, and analysis of macromolecular structures , 2017, PLoS Comput. Biol..

[33]  Andreas Prlic,et al.  BioJava: an open-source framework for bioinformatics in 2012 , 2012, Bioinform..

[34]  F. Allen The Cambridge Structural Database: a quarter of a million crystal structures and rising. , 2002, Acta crystallographica. Section B, Structural science.

[35]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[36]  James G. Shanahan,et al.  Large Scale Distributed Data Science using Apache Spark , 2015, KDD.

[37]  Michael Isard,et al.  Scalability! But at what COST? , 2015, HotOS.

[38]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[39]  Saba Latif,et al.  A survey on Protein Protein Interactions (PPI) methods, databases, challenges and future directions , 2018, 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET).

[40]  J. Irwin,et al.  Benchmarking sets for molecular docking. , 2006, Journal of medicinal chemistry.

[41]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[42]  Dariusz Mrozek,et al.  High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model , 2018, Knowledge and Information Systems.

[43]  Che-Lun Hung,et al.  Cloud Computing for Protein-Ligand Binding Site Comparison , 2013, BioMed research international.

[44]  Adam Godzik,et al.  Flexible structure alignment by chaining aligned fragment pairs allowing twists , 2003, ECCB.

[45]  M. Mezei,et al.  Molecular docking: a powerful approach for structure-based drug discovery. , 2011, Current computer-aided drug design.

[46]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[47]  Liisa Holm,et al.  Dali server: conservation mapping in 3D , 2010, Nucleic Acids Res..

[48]  Zhao Zhang,et al.  Rethinking Data-Intensive Science Using Scalable Analytics Systems , 2015, SIGMOD Conference.

[49]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[50]  Philip E. Bourne,et al.  A robust and efficient algorithm for the shape description of protein structures and its application in predicting ligand binding sites , 2007, BMC Bioinformatics.

[51]  Daozheng Chen,et al.  Predicting Protein Ligand Binding Sites with Structure Alignment Method on Hadoop , 2016 .

[52]  Xiaohua Zhang,et al.  Message passing interface and multithreading hybrid for parallel molecular docking of large databases on petascale high performance computing machines , 2013, J. Comput. Chem..

[53]  Yaw-Ling Lin,et al.  Implementation of a Parallel Protein Structure Alignment Service on Cloud , 2013, International journal of genomics.

[54]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[55]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[56]  Hugh P. Shanahan,et al.  Bioinformatics on the Cloud Computing Platform Azure , 2014, PloS one.

[57]  Kevin Bryson,et al.  Computer-assisted protein domain boundary prediction using the DomPred server. , 2007, Current protein & peptide science.

[58]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[59]  David A. Agard,et al.  Structural characterization of a subtype-selective ligand reveals a novel mode of estrogen receptor antagonism , 2002, Nature Structural Biology.

[60]  Bernard F. Buxton,et al.  The DISOPRED server for the prediction of protein disorder , 2004, Bioinform..

[61]  Geoffrey C. Fox,et al.  MapReduce in the Clouds for Science , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[62]  Robert Schmieder,et al.  Big data challenges and opportunities in high-throughput sequencing , 2013 .

[63]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[64]  Barry Honig,et al.  Structural bioinformatics of the interactome. , 2014, Annual review of biophysics.

[65]  Chris Sander,et al.  Touring protein fold space with Dali/FSSP , 1998, Nucleic Acids Res..

[66]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[67]  J. S. Sodhi,et al.  Predicting metal-binding site residues in low-resolution structural models. , 2004, Journal of molecular biology.

[68]  Daniel W. A. Buchan,et al.  Scalable web services for the PSIPRED Protein Analysis Workbench , 2013, Nucleic Acids Res..

[69]  E Ray Dorsey,et al.  Financial anatomy of biomedical research. , 2005, JAMA.

[70]  Xian-He Sun,et al.  Performance comparison under failures of MPI and MapReduce: An analytical approach , 2013, Future Gener. Comput. Syst..

[71]  Trilce Estrada,et al.  A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach , 2012, Comput. Biol. Medicine.

[72]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[73]  G. Morris,et al.  Molecular docking. , 2008, Methods in molecular biology.

[74]  Jerrold L. Wagener High performance fortran , 1996, Comput. Stand. Interfaces.

[75]  Timothy Nugent,et al.  Membrane protein structural bioinformatics. , 2012, Journal of structural biology.

[76]  Hans Briem,et al.  A crystallographic fragment screen identifies cinnamic acid derivatives as starting points for potent Pim-1 inhibitors. , 2011, Acta crystallographica. Section D, Biological crystallography.

[77]  Andrew E. Torda,et al.  The GROMOS biomolecular simulation program package , 1999 .

[78]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[79]  M. Karplus,et al.  CHARMM: A program for macromolecular energy, minimization, and dynamics calculations , 1983 .

[80]  Marta Mattoso,et al.  Exploring Large Scale Receptor-Ligand Pairs in Molecular Docking Workflows in HPC Clouds , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[81]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[82]  Geoffrey C. Fox,et al.  Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[83]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[84]  Christine A. Orengo,et al.  FFPred: an integrated feature-based function prediction server for vertebrate proteomes , 2008, Nucleic Acids Res..

[85]  Nathan Linial,et al.  Approximate protein structural alignment in polynomial time. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[86]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[87]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[88]  Weisong Shi,et al.  CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping , 2011, BMC Research Notes.

[89]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[90]  David E. Culler,et al.  User-Centric Performance Analysis of Market-Based Cluster Batch Schedulers , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[91]  Fabian A. Buske,et al.  VariantSpark: population scale clustering of genotype information , 2015, BMC Genomics.

[92]  W R Taylor,et al.  SSAP: sequential structure alignment program for protein structure comparison. , 1996, Methods in enzymology.

[93]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[94]  Lars George,et al.  HBase - The Definitive Guide: Random Access to Your Planet-Size Data , 2011 .

[95]  R. Nussinov,et al.  Protein–protein interactions: Structurally conserved residues distinguish between binding sites and exposed protein surfaces , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[96]  J L Sussman,et al.  Protein Data Bank archives of three-dimensional macromolecular structures. , 1997, Methods in enzymology.