Tiling and Scheduling of Three-level Perfectly Nested Loops with Dependencies on Heterogeneous Systems

Nested loops are one of the most time-consuming parts and the largest sources of parallelism in many scientific applications. In this paper, we address the problem of 3-dimensional tiling and scheduling of three-level perfectly nested loops with dependencies on heterogeneous systems. To exploit the parallelism, we tile and schedule nested loops with dependencies by awareness of computational power of the processing nodes and execute them in pipeline mode. The tile size plays an important role to improve the parallel execution time of nested loops. We develop and evaluate a theoretical model to estimate the parallel execution time of tilled nested loops. Also, we propose a tiling genetic algorithm that used the proposed model to find the near-optimal tile size, minimizing the parallel execution time of dependence nested loops. We demonstrate the accuracy of theoretical model and effectiveness of the proposed tiling genetic algorithm by several experiments on heterogeneous systems. The 3D tiling reduces the parallel execution time by a factor of 1.2x to 2x over the 2D tiling, while parallelizing 3D heat equation as a benchmark.

[1]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[2]  Hesham El-Rewini,et al.  Advanced Computer Architecture and Parallel Processing , 2005 .

[3]  华中科技大学,et al.  华中科技大学学報 = Journal of Huazhong University of Science and Technology , 2001 .

[4]  Achim Basermann Parallelizing iterative solvers for sparse systems of equations and eigenproblems on distributed-memory machines , 1994 .

[5]  Jennifer Widom,et al.  PARALLEL AND DISTRIBUTED SYSTEMS , 2010 .

[6]  Soon Cheol Park Efficient Data Structures and Algorithms for Scientific Computations. , 1991 .

[7]  Nawwaf N. Kharma,et al.  An Efficient Genetic Algorithm for Task Scheduling in Heterogeneous Distributed Computing Systems , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[8]  Chao-Tung Yang,et al.  Implementation of a Performance-Based Loop Scheduling on Heterogeneous Clusters , 2009, ICA3PP.

[9]  Anthony T. Chronopoulos,et al.  Joint rate and power control with pricing , 2005, GLOBECOM '05. IEEE Global Telecommunications Conference, 2005..

[10]  Tony Kai Yun Chan,et al.  Task partitionings for parallel triangular solver on a MIMD computer , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[11]  Anthony T. Chronopoulos,et al.  Dynamic multi phase scheduling for heterogeneous clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[12]  Uday Bondhugula Compiling affine loop nests for distributed-memory parallel architectures , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Alexey L. Lastovetsky Heterogeneity in parallel and distributed computing , 2013, J. Parallel Distributed Comput..

[14]  Hui Liu,et al.  HSIP: A Novel Task Scheduling Algorithm for Heterogeneous Computing , 2016, Sci. Program..

[15]  Saeed Parsa,et al.  Locality-Conscious Nested-Loops Parallelization , 2014 .

[16]  Roland Glowinski,et al.  Computational science for the 21st Century , 1997 .

[17]  Yves Robert,et al.  Matrix Multiplication on Heterogeneous Platforms , 2001, IEEE Trans. Parallel Distributed Syst..

[18]  Sascha M. Schnepp,et al.  Pipelined, Flexible Krylov Subspace Methods , 2015, SIAM J. Sci. Comput..

[19]  Xiaorong Li,et al.  A Sequential Cooperative Game Theoretic Approach to Storage-Aware Scheduling of Multiple Large-Scale Workflow Applications in Grids , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[20]  Safia Kedad-Sidhoum,et al.  Scheduling independent tasks on multi‐cores with GPU accelerators , 2015, Concurr. Comput. Pract. Exp..

[21]  Yong Wang,et al.  A Task Allocation Schema Based on Response Time Optimization in Cloud Computing , 2014, ArXiv.

[22]  Carsten F. Ball,et al.  Smart Quality Enhancement in High Capacity Geran Networks , 2006, 2006 IEEE 17th International Symposium on Personal, Indoor and Mobile Radio Communications.

[23]  Alexey L. Lastovetsky,et al.  High Performance Heterogeneous Computing , 2009, Wiley series on parallel and distributed computing.

[24]  Sébastien Le Digabel NOMAD: Nonlinear Optimization with the MADS Algorithm , 2009 .

[25]  Shaoyi Song,et al.  Research on Load Balancing in Cloud Computing Based on Marketing Theory , 2013 .

[26]  Anthony T. Chronopoulos,et al.  Towards the optimal synchronization granularity for dynamic scheduling of pipelined computations on heterogeneous computing systems , 2012, Concurr. Comput. Pract. Exp..

[27]  Maurice Clint,et al.  The Computation of Partial Eigensolutions on a Distributed Memory Machine Using a Modified Lanzos Method , 1996, Euro-Par, Vol. II.

[28]  Yves Robert,et al.  Algorithmic Issues on Heterogeneous Computing Platforms , 1999, Parallel Process. Lett..

[29]  Sebastián Reyes,et al.  A Quadratic Self-Scheduling Algorithm for Heterogeneous Distributed Computing Systems , 2006, 2006 IEEE International Conference on Cluster Computing.

[30]  Panayiotis Tsanakas,et al.  Dynamic scheduling of nested loops with uniform dependencies in heterogeneous networks of workstations , 2005, 8th International Symposium on Parallel Architectures,Algorithms and Networks (ISPAN'05).

[31]  Pen-Chung Yew,et al.  Tile size selection revisited , 2013, ACM Trans. Archit. Code Optim..

[32]  Mitsuo Gen,et al.  Genetic algorithms and engineering optimization , 1999 .

[33]  Shiping Chen,et al.  Partitioning and scheduling loops on NOWs , 1999, Comput. Commun..

[34]  Jingling Xue Communication-Minimal Tiling of Uniform Dependence Loops , 1997, J. Parallel Distributed Comput..

[35]  H. Martin Bücker,et al.  Reducing global synchronization in the biconjugate gradient method , 1999 .

[36]  Anthony T. Chronopoulos,et al.  Enhancing self-scheduling algorithms via synchronization and weighting , 2008, J. Parallel Distributed Comput..

[37]  Neeraj Pandey Comparative Analysis of Job Scheduling for Grid Environment , 2013 .

[38]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[39]  F. Castejón,et al.  Simulations of fast ions distribution in stellarators based on coupled Monte Carlo fuelling and orbit codes , 2013 .

[40]  Hong He,et al.  Honeybee Mating Optimization Algorithm For Task Assignment In Heterogeneous Computing Systems , 2013, Intell. Autom. Soft Comput..

[41]  K Shahu Chatrapati Competitive equilibrium approach for load balanicing a grid network , 2011 .

[42]  Pamela L. Eddy COLLEGE ' OF WILLIAM AND MARY , 2004 .

[43]  Jaber Karimpour,et al.  3‐D data partitioning for 3‐level perfectly nested loops on heterogeneous distributed systems , 2017, Concurr. Comput. Pract. Exp..

[44]  Gang Wei,et al.  Game-theoretic rate allocation with balanced traffic in collaborative transmission over heterogeneous wireless access networks , 2012, IET Commun..

[45]  A. Peirce Computer Methods in Applied Mechanics and Engineering , 2010 .

[46]  Xili Wang A novel approach of solving the CNF-SAT problem , 2013, ArXiv.

[47]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[48]  Yves Raynaud,et al.  Integrated Network Management IV , 1995, IFIP — The International Federation for Information Processing.

[49]  Yves Robert,et al.  Static tiling for heterogeneous computing platforms , 1999, Parallel Comput..

[50]  Jeffrey S. Vetter,et al.  Examining recent many-core architectures and programming models using SHOC , 2015, PMBS '15.

[51]  Cristina L. Abad,et al.  DARE: Adaptive Data Replication for Efficient Cluster Scheduling , 2011, 2011 IEEE International Conference on Cluster Computing.

[52]  Wim Vanroose,et al.  Improving the arithmetic intensity of multigrid with the help of polynomial smoothers , 2012, Numer. Linear Algebra Appl..

[53]  Theodore Andronikos,et al.  Distributed dynamic load balancing for pipelined computations on heterogeneous systems , 2011, Parallel Comput..

[54]  David Padua,et al.  Encyclopedia of Parallel Computing , 2011 .

[55]  Minyi Guo,et al.  Optimally Maximizing Iteration-Level Loop Parallelism , 2012, IEEE Transactions on Parallel and Distributed Systems.

[56]  Saeed Parsa,et al.  A New Genetic Algorithm for Loop Tiling , 2006, The Journal of Supercomputing.

[57]  Markus Kowarschik,et al.  An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms , 2002, Algorithms for Memory Hierarchies.

[58]  Yiming Yang,et al.  A Secure File Allocation Algorithm for Heterogeneous Distributed Systems , 2011, 2011 40th International Conference on Parallel Processing Workshops.

[59]  Simon Miles,et al.  Cluster Computing and Grid (CCGrid) , 2005 .

[60]  Marcel Bauer,et al.  Numerical Methods for Partial Differential Equations , 1994 .

[61]  Sivasankaran Rajamanickam,et al.  Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[62]  Luis Pastor,et al.  Parallel CBIR implementations with load balancing algorithms , 2006, J. Parallel Distributed Comput..

[63]  Anisaara Nadaph,et al.  Methodical Analysis of Various Balancer Conditions on Public Cloud Division , 2015, 2015 International Conference on Computing Communication Control and Automation.

[64]  Daniel Grosu,et al.  Incentive-centered design for scheduling in parallel and distributed systems , 2009 .

[65]  Geoffrey C. Fox,et al.  Distributed and Cloud Computing: From Parallel Processing to the Internet of Things , 2011 .

[66]  Unsymmetric Linear,et al.  A BLOCK VARIANT OF THE GMRES METHOD FOR , 1996 .

[67]  Tinku Mohamed Rasheed,et al.  Power control game for spectrum sharing in public safety communications , 2013, 2013 IEEE 18th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD).

[68]  Xing Zhou,et al.  Optimal Parallelogram Selection for Hierarchical Tiling , 2015, ACM Trans. Archit. Code Optim..

[69]  P. Sadayappan,et al.  Nested Loop Tiling for Distributed Memory Machines , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[70]  Michael J. Quinn,et al.  Three-dimensional grid partitioning for network parallel processing , 1994, CSC '94.

[71]  Ioana Banicescu,et al.  A Load Balancing Tool for Distributed Parallel Loops , 2003, Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003..

[72]  H. A. van der Vorst,et al.  PARALLEL LINEAR SYSTEMS SOLVERS: SPARSE ITERATIVE METHODS , 1996 .

[73]  Alexey L. Lastovetsky,et al.  Data Partitioning with a Functional Performance Model of Heterogeneous Processors , 2007, Int. J. High Perform. Comput. Appl..

[74]  Sudarshan S. Deshmukh,et al.  Improved Queuing Mechanism for Hybrid Load Balancing Scheme in Interactive Application , 2013 .

[75]  T. Manteuffel,et al.  Adaptive polynomial preconditioning for hermitian indefinite linear systems , 1989 .

[76]  Li Cheng,et al.  A Novel Load Balancing Optimization Algorithm Based on Peer-to-Peer Technology in Streaming Media , 2012 .

[77]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[78]  James Demmel,et al.  Avoiding Communication in Nonsymmetric Lanczos-Based Krylov Subspace Methods , 2013, SIAM J. Sci. Comput..

[79]  Anthony T. Chronopoulos,et al.  Optimal synchronization frequency for dynamic pipelined computations on heterogeneous systems , 2007, 2007 IEEE International Conference on Cluster Computing.

[80]  Anthony T. Chronopoulos,et al.  Studying the impact of synchronization frequency on scheduling tasks with dependencies in heterogeneous systems , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[81]  Sajal K. Das,et al.  A Case Study-based Performance Evaluation Framework for CSCF Processes on a Blade-Server , 2007, International Conference on Networking and Services (ICNS '07).

[82]  Sevin Fide,et al.  A middleware approach for pipelining communications in clusters , 2007, Cluster Computing.

[83]  Yves Robert,et al.  Determining the idle time of a tiling: new results , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[84]  James R. Cloutier,et al.  Periodically preconditioned conjugate gradient-restoration algorithm for optimal control - The direct approach , 1996 .