Hadoop‐based replica exchange over heterogeneous distributed cyberinfrastructures

We present Hadoop‐based replica exchange (HaRE), a Hadoop‐based implementation of the replica exchange scheme developed primarily for replica exchange statistical temperature molecular dynamics, an example of a large‐scale, advanced sampling molecular dynamics simulation. By using Hadoop as a framework and the MapReduce model for driving replica exchange, an efficient task‐level parallelism is introduced to replica exchange statistical temperature molecular dynamics simulations. In order to demonstrate this, we investigate the performance of our application over various distributed cyberinfrastructures (DCI), including several high‐performance computing systems, our cyberinfrastructure for reconfigurable optical networks testbed, the global environment for network innovations testbed, and the CloudLab testbed. Scalability performance analysis is shown in terms of scale‐out and scale‐up over a single high‐performance computing cluster, EC2, and CloudLab and scale‐across with cyberinfrastructure for reconfigurable optical networks and global environment for network innovations. As a result, we demonstrate that HaRE is capable of efficient execution over both homogeneous and heterogeneous DCI of varying size and configuration. Contributing factors to performance are discussed in order to provide insight towards the effects of computing environment on the execution of HaRE. With these contributions, we propose that similar loosely coupled scientific applications can also take advantage of the scalable, task‐level parallelism Hadoop MapReduce provides over various DCI. Copyright © 2016 John Wiley & Sons, Ltd.

[1]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[2]  Shantenu Jha,et al.  Developing eThread Pipeline Using SAGA-Pilot Abstraction for Large-Scale Structural Bioinformatics , 2014, BioMed research international.

[3]  Akihiro Nakao,et al.  GENI: A federated testbed for innovative network experiments , 2014, Comput. Networks.

[4]  Sandhya Dwarkadas,et al.  Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference , 2002, Bioinform..

[5]  Shantenu Jha,et al.  Understanding mapreduce-based next-generation sequencing alignment on distributed cyberinfrastructure , 2012, ECMLS '12.

[6]  Chun-Yu Wang,et al.  FedLoop: Looping on Federated MapReduce , 2014, 2014 IEEE 13th International Conference on Trust, Security and Privacy in Computing and Communications.

[7]  J. Straub,et al.  Statistical temperature molecular dynamics: application to coarse-grained beta-barrel-forming protein models. , 2007, The Journal of chemical physics.

[8]  Lavanya Ramakrishnan,et al.  Processing Cassandra Datasets with Hadoop-Streaming Based Approaches , 2016, IEEE Transactions on Services Computing.

[9]  Chun-Yu Wang,et al.  Federated MapReduce to Transparently Run Applications on Multicluster Environment , 2014, 2014 IEEE International Congress on Big Data.

[10]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[11]  V. Praveenkumar,et al.  Big data in genomics , 2017, 2017 International Conference on Algorithms, Methodology, Models and Applications in Emerging Technologies (ICAMMAET).

[12]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[13]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[14]  Seung-Jong Park,et al.  Network-aware scheduling of mapreduce framework ondistributed clusters over high speed networks , 2012, FederatedClouds '12.

[15]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[16]  Wei Huang,et al.  Enabling Large-Scale Biomolecular Conformation Search with Replica Exchange Statistical Temperature Molecular Dynamics (RESTMD) over HPC and Cloud Computing Resources , 2015, 2015 IEEE 29th International Conference on Advanced Information Networking and Applications Workshops.

[17]  Joohyun Kim,et al.  All-atom molecular dynamics simulations of beta-hairpins stabilized by a tight turn: pronounced heterogeneous folding pathways. , 2010, The journal of physical chemistry. B.

[18]  Lavanya Ramakrishnan,et al.  MARISSA: MApReduce Implementation for Streaming Science Applications , 2012, 2012 IEEE 8th International Conference on E-Science.

[19]  J. Straub,et al.  Replica exchange statistical temperature molecular dynamics algorithm. , 2012, The journal of physical chemistry. B.

[20]  J. Straub,et al.  Statistical-temperature Monte Carlo and molecular dynamics algorithms. , 2006, Physical review letters.

[21]  Seung-Jong Park,et al.  Demonstration on Fairness Among Heterogeneous TCP Variants Over 10 Gbps High-speed Networks , 2012 .

[22]  John L. Klepeis,et al.  A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Suman Kumar,et al.  Experimental evaluation of the effect of queue management schemes on the performance of high speed TCPs in 10Gbps network environment , 2012, 2012 International Conference on Computing, Networking and Communications (ICNC).

[24]  Yuan Luo,et al.  Hierarchical MapReduce Programming Model and Scheduling Algorithms , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[25]  Xin Yang,et al.  Domain-Based MapReduce Programming Model for Complex Scientific Applications , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[26]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[27]  Shantenu Jha,et al.  Developing Scientific Applications with Loosely-Coupled Sub-tasks , 2009, ICCS.

[28]  John E Straub,et al.  Replica exchange statistical temperature Monte Carlo. , 2009, The Journal of chemical physics.

[29]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[30]  Kai Wang,et al.  BioPig: a Hadoop-based analytic toolkit for large-scale sequence data , 2013, Bioinform..

[31]  Yuko Okamoto,et al.  Generalized-ensemble algorithms: enhanced sampling techniques for Monte Carlo and molecular dynamics simulations. , 2003, Journal of molecular graphics & modelling.

[32]  Jianpeng Ma,et al.  CHARMM: The biomolecular simulation program , 2009, J. Comput. Chem..

[33]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[34]  Seung-Jong Park,et al.  MapReduce-Based RESTMD: Enabling Large-Scale Sampling Tasks with Distributed HPC Systems , 2014, 2014 6th International Workshop on Science Gateways.

[35]  Rafael C. Bernardi,et al.  Enhanced sampling techniques in molecular dynamics simulations of biological systems. , 2015, Biochimica et biophysica acta.

[36]  Marek S. Wiewiórka,et al.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision , 2014, Bioinform..

[37]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .