Task-parallel Analysis of Molecular Dynamics Trajectories

Different parallel frameworks for implementing data analysis applications have been proposed by the HPC and Big Data communities. In this paper, we investigate three task-parallel frameworks: Spark, Dask and RADICAL-Pilot with respect to their ability to support data analytics on HPC resources and compare them to MPI. We investigate the data analysis requirements of Molecular Dynamics (MD) simulations which are significant consumers of supercomputing cycles, producing immense amounts of data. A typical large-scale MD simulation of a physical system of O(100k) atoms over μsecs can produce from O(10) GB to O(1000) GBs of data. We propose and evaluate different approaches for parallelization of a representative set of MD trajectory analysis algorithms, in particular the computation of path similarity and leaflet identification. We evaluate Spark, Dask and RADICAL-Pilot with respect to their abstractions and runtime engine capabilities to support these algorithms. We provide a conceptual basis for comparing and understanding different frameworks that enable users to select the optimal system for each application. We also provide a quantitative performance analysis of the different algorithms across the three frameworks.

[1]  Daniel S. Katz,et al.  Introducing distributed dynamic data‐intensive (D3) science: Understanding applications and infrastructure , 2016, Concurr. Comput. Pract. Exp..

[2]  Shantenu Jha,et al.  Hadoop on HPC: Integrating Hadoop and Pilot-Based Dynamic Resource Management , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[3]  Daniel R. Roe,et al.  Parallelization of CPPTRAJ enables large scale analysis of molecular dynamics trajectory data , 2018, J. Comput. Chem..

[4]  Geoffrey C. Fox,et al.  Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink , 2018, Int. J. High Perform. Comput. Appl..

[5]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[6]  Konrad Hinsen,et al.  nMoldyn 3: Using task farming for a parallel spectroscopy‐oriented analysis of molecular dynamics simulations , 2012, J. Comput. Chem..

[7]  Shantenu Jha,et al.  P∗: A model of pilot-abstractions , 2012, 2012 IEEE 8th International Conference on E-Science.

[8]  Matthew Rocklin,et al.  Dask: Parallel Computation with Blocked algorithms and Task Scheduling , 2015, SciPy.

[9]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[10]  Shantenu Jha,et al.  Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data , 2012, MapReduce '12.

[11]  Shantenu Jha,et al.  A Building Blocks Approach towards Domain Specific Workflow Systems? , 2016, ArXiv.

[12]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[13]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Oliver Beckstein,et al.  MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations , 2016, SciPy.

[15]  Geoffrey C. Fox,et al.  Twister:Net - Communication Library for Big Data Processing in HPC and Cloud Environments , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).

[16]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[17]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[18]  Geoffrey C. Fox,et al.  Towards an Understanding of Facets and Exemplars of Big Data Applications , 2014 .

[19]  Matteo Turilli,et al.  Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[20]  Geoffrey C. Fox,et al.  HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[21]  Shantenu Jha,et al.  A Comprehensive Perspective on Pilot-Job Systems , 2015, ACM Comput. Surv..

[22]  Semen O. Yesylevskyy,et al.  Pteros 2.0: Evolution of the fast parallel molecular analysis library for C++ and python , 2015, J. Comput. Chem..

[23]  Daniel S. Katz,et al.  Evaluating Distributed Execution of Workloads , 2016, 2017 IEEE 13th International Conference on e-Science (e-Science).

[24]  Daniel R Roe,et al.  PTRAJ and CPPTRAJ: Software for Processing and Analysis of Molecular Dynamics Trajectory Data. , 2013, Journal of chemical theory and computation.

[25]  Dhabaleswar K. Panda,et al.  High-performance design of apache spark with RDMA and its benefits on various workloads , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[26]  Shantenu Jha,et al.  Building Blocks for Workflow System Middleware , 2018, 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[27]  John L. Klepeis,et al.  A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Charles E. McAnany,et al.  An introduction to biomolecular simulations and docking , 2014, 1407.3752.

[29]  Avishek Kumar,et al.  Path Similarity Analysis: A Method for Quantifying Macromolecular Pathways , 2015, PLoS Comput. Biol..

[30]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[31]  Mario A. Storti,et al.  MPI for Python , 2005, J. Parallel Distributed Comput..

[32]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[33]  Shantenu Jha,et al.  Using Pilot Systems to Execute Many Task Workloads on Supercomputers , 2015, JSSPP.

[34]  Oliver Beckstein,et al.  MDAnalysis: A toolkit for the analysis of molecular dynamics simulations , 2011, J. Comput. Chem..

[35]  Stephen M. Omohundro,et al.  Five Balltree Construction Algorithms , 2009 .

[36]  Shantenu Jha,et al.  Design and Performance Characterization of RADICAL-Pilot on Titan , 2018, ArXiv.

[37]  Daniel R. Roe,et al.  The Impact of Heterogeneous Computing on Workflows for Biomolecular Simulation and Analysis , 2015, Computing in Science & Engineering.

[38]  Shantenu Jha,et al.  RepEx: A Flexible Framework for Scalable Replica Exchange Molecular Dynamics Simulations , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[39]  Judy Qiu,et al.  A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures , 2014, 2014 IEEE International Congress on Big Data.

[40]  Shantenu Jha,et al.  High-throughput binding affinity calculations at extreme scales , 2018, BMC Bioinformatics.

[41]  Shantenu Jha,et al.  Parallel Analysis in MDAnalysis using the Dask Parallel Computing Library , 2017, SciPy.

[42]  Thomas J Lane,et al.  MDTraj: a modern, open library for the analysis of molecular dynamics trajectories , 2014, bioRxiv.

[43]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[44]  KamburugamuveSupun,et al.  Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink , 2018 .

[45]  Shantenu Jha,et al.  ExTASY: Scalable and flexible coupling of MD simulations and advanced sampling techniques , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[46]  Allan Hanbury,et al.  An Efficient Algorithm for Calculating the Exact Hausdorff Distance , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.