Algorithms design for the parallelization of nested loops

The need for parallel processing arises from the existence of time consuming applications in different areas, such as weather forecasting, nuclear fusion simulations, DNA and protein analysis, computational fluid dynamics, etc. Parallel processing comprises algorithms, computer architecture, parallel programming and performance analysis. In optimizing the performance of scientific and engineering sequential programs, the most gain comes from optimizing nested loops or recursive procedures, where major chunks of computation are performed repeatedly. Nested loops without dependencies are called DOALL, while those with dependencies are called DOACROSS loops. Parallelizing DOACROSS loops is much more challenging than parallelizing DOALL loops, because the existing dependencies between iterations of the loop nest much be satisfied. The challenges that must be addressed for the parallelization of time consuming applications are: minimizing the total execution time, minimizing the communication time between the processors (especially in the case of DOACROSS loops), load balancing the computational load among the processors, dealing with and recovering from failures that may occur either in the program or the system, meeting deadlines, or a combination of these. This doctoral dissertation focuses on parallelizing applications that contain nested DOACROSS loops, while trying to address some of the aforementioned challenges. In particular, it proposes and presents four static methods and three dynamic methods for scheduling nested DOACROSS loops on various architectures. The static scheduling methods were devised for homogeneous systems, while the dynamic scheduling methods were devised for heterogeneous systems or systems with rapidly varying loads. One of the dynamic approaches was bibliographically the first attempt towards the parallelization of nested DOACROSS loops using a coarse grain approach and dynamic scheduling, on heterogeneous systems. The proposed algorithms were implemented, verified and evaluated through extensive experiments on various computer systems architectures.

[1]  Yu Hen Hu,et al.  A novel modular systolic array architecture for full-search block matching motion estimation , 1995, IEEE Trans. Circuits Syst. Video Technol..

[2]  Mihalis Yannakakis,et al.  Towards an architecture-independent analysis of parallel algorithms , 1990, STOC '88.

[3]  Wentong Cai,et al.  Time-minimal tiling when rise is larger than zero , 2002, Parallel Comput..

[4]  Rupert W. Ford,et al.  An investigation of feedback guided dynamic scheduling of nested loops , 2000, Proceedings 2000. International Workshop on Parallel Processing.

[5]  Rudolf Eigenmann,et al.  Automatic program parallelization , 1993, Proc. IEEE.

[6]  Amit Rao,et al.  Optimal task scheduling at run time to exploit intra-tile parallelism , 2003, Parallel Comput..

[7]  Alan Weiss,et al.  Allocating Independent Subtasks on Parallel Processors , 1985, IEEE Transactions on Software Engineering.

[8]  Theodore Andronikos,et al.  Cronus: A platform for parallel code generation based on computational geometry methods , 2008, J. Syst. Softw..

[9]  Nectarios Koziris,et al.  Lower Time and Processor Bounds for Efficient Mapping of Uniform Dependence Algorithms into Systolic Arrays , 1997, Parallel Algorithms Appl..

[10]  Weijia Shang,et al.  Time Optimal Linear Schedules for Algorithms with Uniform Dependencies , 1991, IEEE Trans. Computers.

[11]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[12]  Jeanette P. Schmidt,et al.  Load-sharing in heterogeneous systems via weighted factoring , 1996, SPAA '96.

[13]  Eugene L. Lawler,et al.  Scheduling In and Out Forests in the Presence of Communication Delays , 1996, IEEE Trans. Parallel Distributed Syst..

[14]  Yves Robert,et al.  Linear Scheduling Is Nearly Optimal , 1991, Parallel Process. Lett..

[15]  Allan Gottlieb,et al.  Highly parallel computing , 1989, Benjamin/Cummings Series in computer science and engineering.

[16]  Edward D. Lazowska,et al.  Adaptive load sharing in homogeneous distributed systems , 1986, IEEE Transactions on Software Engineering.

[17]  Behrooz Parhami,et al.  Introduction to Parallel Processing: Algorithms and Architectures , 1999 .

[18]  Tarek S. Abdelrahman,et al.  Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors , 2001, IEEE Trans. Parallel Distributed Syst..

[19]  Theodore Andronikos,et al.  Reducing the Communication Cost via Chain Pattern Scheduling , 2005, Fourth IEEE International Symposium on Network Computing and Applications.

[20]  Timothy G. Mattson,et al.  Patterns for parallel programming , 2004 .

[21]  Philippe Chrétienne Task scheduling with interprocessor communication delays , 1992 .

[22]  Yves Robert,et al.  Resource-constrained scheduling of partitioned algorithms on processor arrays , 1995, Proceedings Euromicro Workshop on Parallel and Distributed Processing.

[23]  T. Andronikos,et al.  Adaptive Cyclic Scheduling of Nested Loops , 2005 .

[24]  Larry Carter,et al.  Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[25]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[26]  Nectarios Koziris,et al.  Chain Grouping: A Method for Partitioning Loops onto Mesh-Connected Processor Arrays , 2000, IEEE Trans. Parallel Distributed Syst..

[27]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[28]  Sanguthevar Rajasekaran,et al.  Online Scheduling of Dynamic Trees , 1995, Parallel Process. Lett..

[29]  Jon Feldman,et al.  Parallel processor scheduling with delay constraints , 2001, SODA '01.

[30]  Nectarios Koziris,et al.  Geometric scheduling of 2-D uniform dependence loops , 2001, Proceedings. Eighth International Conference on Parallel and Distributed Systems. ICPADS 2001.

[31]  Pierre Ramet,et al.  Optimal Grain Size Computation for Pipelined Algorithms , 1996, Euro-Par, Vol. I.

[32]  Berna L. Massingill Patterns for Parallel Application Programs , 1999 .

[33]  Nectarios Koziris,et al.  Message-passing code generation for non-rectangular tiling transformations , 2006, Parallel Comput..

[34]  I. Niven,et al.  An introduction to the theory of numbers , 1961 .

[35]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[36]  Marco Spuri,et al.  Implications of Classical Scheduling Results for Real-Time Systems , 1995, Computer.

[37]  Theodore Andronikos,et al.  Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments , 2006, 2006 IEEE International Conference on Cluster Computing.

[38]  P. Theodoropoulos,et al.  CODE GENERATION FOR GENERAL LOOPS USING METHODS FROM COMPUTATIONAL GEOMETRY , 2004 .

[39]  Yves Robert,et al.  Resource-constrained scheduling of partitioned algorithms on processor arrays , 1996, Integr..

[40]  Jang-Ping Sheu,et al.  Partitioning and mapping of nested loops for linear array multicomputers , 1995, The Journal of Supercomputing.

[41]  Jeffrey D. Ullman,et al.  NP-Complete Scheduling Problems , 1975, J. Comput. Syst. Sci..

[42]  Nectarios Koziris,et al.  Evaluation of loop grouping methods based on orthogonal projection spaces , 2000, Proceedings 2000 International Conference on Parallel Processing.

[43]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[44]  Edith Schonberg,et al.  Factoring: a method for scheduling parallel loops , 1992 .

[45]  Yu Hen Hu,et al.  A novel modular systolic array architecture for full-search block matching motion estimation , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[46]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[47]  Dan I. Moldovan,et al.  Parallel processing - from applications to systems , 1993 .

[48]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[49]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[50]  David K. Lowenthal,et al.  Accurately Selecting Block Size at Runtime in Pipelined Parallel Programs , 2000, International Journal of Parallel Programming.

[51]  Yves Robert,et al.  On the Removal of Anti- and Output-Dependences , 2004, International Journal of Parallel Programming.

[52]  David P. Dobkin,et al.  The quickhull algorithm for convex hulls , 1996, TOMS.

[53]  Joseph H. Silverman,et al.  A Friendly Introduction to Number Theory , 1996 .

[54]  Theodore Andronikos,et al.  Scheduling Nested Loops with the Least Number of Processors , 2003, Applied Informatics.

[55]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[56]  Anthony T. Chronopoulos,et al.  Optimal synchronization frequency for dynamic pipelined computations on heterogeneous systems , 2007, 2007 IEEE International Conference on Cluster Computing.

[57]  Dan I. Moldovan,et al.  Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.

[58]  Anthony T. Chronopoulos,et al.  Studying the impact of synchronization frequency on scheduling tasks with dependencies in heterogeneous systems , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[59]  Evangelos P. Markatos,et al.  Using processor affinity in loop scheduling on shared-memory multiprocessors , 1992, Supercomputing '92.

[60]  Theodore Andronikos,et al.  On parallelization of UET / UET-UCT loops , 2001, Neural Parallel Sci. Comput..

[61]  Yves Robert,et al.  (Pen)-ultimate tiling? , 1994, Integr..

[62]  L.M. Ni,et al.  Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers , 1993, IEEE Trans. Parallel Distributed Syst..

[63]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[64]  Arif Ghafoor,et al.  Semi-Distributed Load Balancing For Massively Parallel Multicomputer Systems , 1991, IEEE Trans. Software Eng..

[65]  Nectarios Koziris,et al.  Optimal Time and Efficient Space Free Scheduling For Nested Loops , 1996, Comput. J..

[66]  Oliver Sinnen,et al.  Task Scheduling for Parallel Systems , 2007, Wiley series on parallel and distributed computing.

[67]  Thomas Kunz,et al.  The Influence of Different Workload Descriptions on a Heuristic Load Balancing Scheme , 1991, IEEE Trans. Software Eng..

[68]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[69]  H. Ali,et al.  Task Scheduling in Multiprocessing Systems , 1995, Computer.

[70]  Shiping Chen,et al.  Partitioning and scheduling loops on NOWs , 1999, Comput. Commun..

[71]  Jang-Ping Sheu,et al.  Partitioning and Mapping Nested Loops on Multiprocessor Systems , 1991, IEEE Trans. Parallel Distributed Syst..

[72]  Theodore Andronikos,et al.  An Efficient Scheduling of Uniform Dependence Loops , 2003 .

[73]  T. Andronikos,et al.  Simple Code Generation for special UDLs , 2003 .

[74]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[75]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[76]  Anthony T. Chronopoulos,et al.  Dynamic scheduling for dependence loops on heterogeneous clusters , 2006 .

[77]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[78]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[79]  Sanjay V. Rajopadhye,et al.  Optimal Orthogonal Tiling of 2-D Iterations , 1997, J. Parallel Distributed Comput..

[80]  Anthony T. Chronopoulos,et al.  Enhancing self-scheduling algorithms via synchronization and weighting , 2008, J. Parallel Distributed Comput..

[81]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[82]  Nectarios Koziris,et al.  Geometric Pattern Prediction and Scheduling of Uniform Dependence Loops , 2001 .

[83]  Chung-Ta King,et al.  Pipelined Data Parallel Algorithms-II: Design , 1990, IEEE Trans. Parallel Distributed Syst..

[84]  Anthony T. Chronopoulos,et al.  Dynamic multi phase scheduling for heterogeneous clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.