Efficient Execution of Dynamic Programming Algorithms on Apache Spark

One of the most important properties of distributed computing systems (e.g., Apache Spark, Apache Hadoop, etc) on clusters and computation clouds is the ability to scale out by adding more compute nodes to the cluster. This important feature can lead to performance gain provided the computation (or the algorithm) itself can scale out. In other words, the computation (or the algorithm) should be easily decomposable into smaller units of work to be distributed among the workers based on the hardware/software configuration of the cluster or the cloud. Additionally, on such clusters, there is an important trade-off between communication cost, parallelism, and memory requirement. Due to the scalability need as well as this trade-off, it is crucial to have a well-decomposable, adaptive, tunable, and scalable program. Tunability enables the programmer to find an optimal point in the trade-off spectrum to execute the program efficiently on a specific cluster. We design and implement well-decomposable and tunable dynamic programming algorithms from the Gaussian Elimination Paradigm (GEP), such as Floyd-Warshall's all-pairs shortest path and Gaussian elimination without pivoting, for execution on Apache Spark. Our implementations are based on parametric multi-way recursive divide-&-conquer algorithms. We explain how to map implementations of those grid-based parallel algorithms to the Spark framework. Finally, we provide experimental results illustrating the performance, scalability, and portability of our Spark programs. We show that offloading the computation to an OpenMP environment (by running parallel recursive kernels) within Spark is at least partially responsible for a $2-5\times$ speedup of the DP benchmarks.

[1]  Haris N. Koutsopoulos,et al.  A Decomposition Algorithm for the All-Pairs Shortest Path Problem on Massively Parallel Computer Architectures , 1994, Transp. Sci..

[2]  Wu-chun Feng,et al.  On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[3]  Jaroslaw Zola,et al.  Solving All-Pairs Shortest-Paths Problem in Large Graphs Using Apache Spark , 2019, ICPP.

[4]  Armando Solar-Lezama,et al.  Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations , 2016, OOPSLA.

[5]  Sartaj Sahni,et al.  All Pairs Shortest Paths on a Hypercube Multiprocessor , 1987, ICPP.

[6]  Stanislav G. Sedukhin,et al.  Blocked All-Pairs Shortest Paths Algorithm for Hybrid CPU-GPU System , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[7]  Chen Wang,et al.  Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..

[8]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.

[9]  M. Bell Alternatives to Dial's logit assignment algorithm , 1995 .

[10]  John R. Gilbert,et al.  Solving path problems on the GPU , 2010, Parallel Comput..

[11]  Pramod Ganapathi,et al.  Toward Efficient Architecture-Independent Algorithms for Dynamic Programs , 2019, ISC.

[12]  Reynold Xin,et al.  GraphFrames: an integrated API for mixing graph and relational queries , 2016, GRADES '16.

[13]  Enrique S. Quintana-Ortí,et al.  Tiled Algorithms for Efficient Task-Parallel ℌ-Matrix Solvers , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[14]  Alexander Tiskin,et al.  Efficient Longest Common Subsequence Computation Using Bulk-Synchronous Parallelism , 2006, ICCSA.

[15]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[16]  M. Sniedovich Dynamic programming : foundations and principles , 2011 .

[17]  Zenggang Xiong,et al.  In-memory big data analytics under space constraints using dynamic programming , 2018, Future Gener. Comput. Syst..

[18]  Christos H. Papadimitriou,et al.  On the Floyd-Warshall Algorithm for Logic Programs , 1999, J. Log. Program..

[19]  Alexandru Nicolau,et al.  R-Kleene: A High-Performance Divide-and-Conquer Algorithm for the All-Pair Shortest Path for Densely Connected Networks , 2007, Algorithmica.

[20]  Rezaul Alam Chowdhury,et al.  Toward efficient architecture-independent algorithms for dynamic programs: poster , 2019, PPoPP.

[21]  R Bellman,et al.  On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[22]  John Rust Numerical dynamic programming in economics , 1996 .

[23]  Holden Karau,et al.  High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark , 2017 .

[24]  Vijaya Ramachandran,et al.  Cache-oblivious dynamic programming , 2006, SODA '06.

[25]  Chengqi Zhang,et al.  Scalable big graph processing in MapReduce , 2014, SIGMOD Conference.

[26]  Min Chen,et al.  Cost-aware optimal data allocations for multiple dimensional heterogeneous memories using dynamic programming in big data , 2018, J. Comput. Sci..

[27]  Srinivas Aluru,et al.  PARALLEL-TCOFFEE: A parallel multiple sequence aligner , 2007, PDCS.

[28]  Chen Wang,et al.  A general and fast distributed system for large-scale dynamic programming applications , 2016, Parallel Comput..

[29]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[30]  Carson Kai-Sang Leung,et al.  Mining sequential patterns from uncertain big DNA in the spark framework , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[31]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[32]  Christophe Alias,et al.  Mono-parametric Tiling is a Polyhedral Transformation , 2015 .

[33]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.

[34]  Peter Sanders,et al.  Parallel Graph Partitioning for Complex Networks , 2017, IEEE Transactions on Parallel and Distributed Systems.

[35]  Ilmar M. Wilbers,et al.  Using Cython to Speed up Numerical Python Programs , 2009 .

[36]  Verónica Bolón-Canedo,et al.  An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[37]  Martin Griebl,et al.  Index Set Splitting , 2000, International Journal of Parallel Programming.

[38]  Benjamin Moseley,et al.  Efficient massively parallel methods for dynamic programming , 2017, STOC.

[39]  G. Turkiyyah,et al.  Hierarchical algorithms on hierarchical architectures , 2020, Philosophical Transactions of the Royal Society A.

[40]  Koji Nakano,et al.  Accelerating the Dynamic Programming for the Optimal Polygon Triangulation on the GPU , 2012, ICA3PP.

[41]  Weiguo Liu,et al.  Bio-sequence database scanning on a GPU , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[42]  James Demmel,et al.  Minimizing Communication in All-Pairs Shortest Paths , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[43]  Lei Gu,et al.  Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[44]  Milind Kulkarni,et al.  D2P: Automatically Creating Distributed Dynamic Programming Codes , 2018 .

[45]  I Kadek Laga Dwi Pandika,et al.  Apllication of Optimization Heavy Traffic Path with Floyd-Warshall Algorithm , 2018, 2018 International Conference on Control, Electronics, Renewable Energy and Communications (ICCEREC).

[46]  Paul Feautrier,et al.  Dataflow analysis of array and scalar references , 1991, International Journal of Parallel Programming.

[47]  J. Ramanujam,et al.  Parameterized tiling revisited , 2010, CGO '10.

[48]  Robert Giegerich,et al.  GPU Parallelization of Algebraic Dynamic Programming , 2009, PPAM.

[49]  Jop F. Sibeyn External matrix multiplication and all-pairs shortest path , 2004, Inf. Process. Lett..

[50]  Ken Kennedy,et al.  Transforming loops to recursion for multi-level memory hierarchies , 2000, PLDI '00.

[51]  Andrew W. Moore,et al.  Finding optimal Bayesian networks by dynamic programming , 2005 .

[52]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[53]  Vangelis Th. Paschos Concepts of combinatorial optimization , 2014 .

[54]  Sartaj Sahni,et al.  A blocked all-pairs shortest-paths algorithm , 2003, ACM J. Exp. Algorithmics.

[55]  James G. Shanahan,et al.  Large Scale Distributed Data Science using Apache Spark , 2015, KDD.

[56]  Kevin P. Murphy,et al.  Bayesian structure learning using dynamic programming and MCMC , 2007, UAI.

[57]  Louis-Noël Pouchet,et al.  Deriving parametric multi-way recursive divide-and-conquer dynamic programming algorithms using polyhedral compilers , 2020, CGO.

[58]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[59]  Barbara M. Chapman,et al.  A Comparative Survey of the HPC and Big Data Paradigms: Analysis and Experiments , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[60]  Ben D. Lund,et al.  A Multi-Stage CUDA Kernel for Floyd-Warshall , 2010, ArXiv.

[61]  Zheguang Zhao,et al.  Bridging the Gap between HPC and Big Data frameworks , 2017, Proc. VLDB Endow..

[62]  Barbara Chapman,et al.  Using OpenMP - portable shared memory parallel programming , 2007, Scientific and engineering computation.

[63]  Giuseppe Coviello,et al.  COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors , 2013, HPDC '13.

[64]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[65]  Michael A. Bender,et al.  Cache-Adaptive Algorithms , 2014, SODA.

[66]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[67]  Weiguo Liu,et al.  Streaming Algorithms for Biological Sequence Alignment on GPUs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[68]  Roger Wattenhofer,et al.  Optimal distributed all pairs shortest paths and applications , 2012, PODC '12.

[69]  Joseph T. Kider,et al.  All-pairs shortest-paths for large graphs on the GPU , 2008, GH '08.

[70]  Siu Kwan Lam,et al.  Numba: a LLVM-based Python JIT compiler , 2015, LLVM '15.

[71]  Joshua Zhexue Huang,et al.  Big data analytics on Apache Spark , 2016, International Journal of Data Science and Analytics.

[72]  Bo Xu,et al.  Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark , 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD).

[73]  Dominique Lavenier,et al.  GPU Accelerated RNA Folding Algorithm , 2009, ICCS.

[74]  Frank Mueller,et al.  SparkScore: Leveraging Apache Spark for Distributed Genomic Inference , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[75]  K. B. Haley,et al.  Optimization Theory with Applications , 1970 .

[76]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[77]  Lorenzo Ridi,et al.  Developing a Scheduler with Difference-Bound Matrices and the Floyd-Warshall Algorithm , 2012, IEEE Software.

[78]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[79]  Armando Solar-Lezama,et al.  AUTOGEN: automatic discovery of cache-oblivious parallel recursive algorithms for solving dynamic programs , 2016, PPOPP.

[80]  Ali Akoglu,et al.  Sequence alignment with GPU: Performance and design challenges , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[81]  Wlodzimierz Bielecki,et al.  Using basis dependence distance vectors in the modified Floyd–Warshall algorithm , 2015, J. Comb. Optim..

[82]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[83]  Anu Pradhan,et al.  Finding All-Pairs Shortest Path for a Large-Scale Transportation Network Using Parallel Floyd-Warshall and Parallel Dijkstra Algorithms , 2013, J. Comput. Civ. Eng..

[84]  Parimala Thulasiraman,et al.  Performance study of mapping irregular computations on GPUs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[85]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[86]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[87]  Judy Qiu,et al.  A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures , 2014, 2014 IEEE International Congress on Big Data.

[88]  Bhavani M. Thuraisingham,et al.  Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce , 2009, CloudCom.

[89]  Isaac Woungang,et al.  Modified Floyd-Warshall algorithm for equal cost multipath in software-defined data center , 2015, 2015 IEEE International Conference on Communication Workshop (ICCW).

[90]  Vipin Kumar,et al.  Scalability of Parallel Algorithms for the All-Pairs Shortest-Path Problem , 1991, J. Parallel Distributed Comput..

[91]  Vijaya Ramachandran,et al.  The cache-oblivious gaussian elimination paradigm: theoretical framework and experimental evaluation , 2006, SPAA '06.

[92]  Niladri Chakraborty,et al.  Modification of Floyd-Warshall's algorithm for Shortest Path routing in wireless sensor networks , 2014, 2014 Annual IEEE India Conference (INDICON).

[93]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.