A Proposed Data Partitioning Approach on Heterogeneous HPC Platforms: Data Locality Perspective

We propose a new data partitioning approach to improve the performance of heterogeneous parallel applications in modern high-performance computing (HPC) systems. Existing approaches do not consider an important aspect that has a critical impact on the performance of parallel applications: the method of assigning partitions to each processor so as to minimize the communication cost and hence minimize data movement, which dominates energy and performance cost. Such an aspect for managing data locality is important for a large range of applications. Therefore, to achieve efficient data partitioning, we propose a method for distribution considering this aspect. Our algorithm seeks to minimize execution time by using two models. The first is a fine-grained computational model of heterogeneous processors, which is sufficiently adequate and accurate to guarantee efficient partitioning results that maximize utilization. The second is a communication model of heterogeneous processors to minimize data motion and hide communication overheads. The correctness of our algorithm was analyzed and validated. The complexity of our algorithm is approximately of order <inline-formula> <tex-math notation="LaTeX">$\mathbf {O}(\mathbf {p}\times \mathbf {log \,\,s}+\mathbf {p}\times \mathbf {s}^{\mathbf {2}})$ </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$s$ </tex-math></inline-formula> is problem size/steps (where steps is the step size between data points in the computational model of each processor), and <inline-formula> <tex-math notation="LaTeX">$p$ </tex-math></inline-formula> is the number of heterogeneous processors. The experiments were performed on AZIZ supercomputer using two types of applications: an application with no dependency between its partitions, i.e., matrix multiplication, and another one with high dependency between its partitions, i.e., the Jacobi method. The results show the efficiency of our algorithm in improving performance.

[1]  Viktor K. Prasanna,et al.  Block‐cyclic redistribution over heterogeneous networks , 2004, Cluster Computing.

[2]  Torsten Hoefler,et al.  An Overview of Topology Mapping Algorithms and Techniques in High‐Performance Computing , 2014, HiPC 2014.

[3]  Ziming Zhong,et al.  Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications , 2012, 2012 IEEE International Conference on Cluster Computing.

[4]  Alexey L. Lastovetsky,et al.  Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors Based on Functional Performance Models , 2011, Euro-Par Workshops.

[5]  Alexey L. Lastovetsky,et al.  Data distribution for dense factorization on computers with memory heterogeneity , 2007, Parallel Comput..

[6]  Brett A. Becker,et al.  Partitioning for Parallel Matrix-Matrix Multiplication with Heterogeneous Processors: The Optimal Solution , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[7]  Satoshi Matsuoka,et al.  An efficient, model-based CPU-GPU heterogeneous FFT library , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[8]  Alexey L. Lastovetsky,et al.  Data Partitioning with a Functional Performance Model of Heterogeneous Processors , 2007, Int. J. High Perform. Comput. Appl..

[9]  Onur Mutlu,et al.  The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality In GPUs , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[10]  Yves Robert,et al.  Matrix-matrix multiplication on heterogeneous platforms , 2000, Proceedings 2000 International Conference on Parallel Processing.

[11]  Kai Lu,et al.  Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing , 2010, 2010 IEEE International Conference on Cluster Computing.

[12]  Alexey L. Lastovetsky,et al.  Two-Dimensional Matrix Partitioning for Parallel Computing on Heterogeneous Processors Based on Their Functional Performance Models , 2009, Euro-Par Workshops.

[13]  Siham Tabik,et al.  A Data Partitioning Model for Highly Heterogeneous Systems , 2016, Euro-Par Workshops.

[14]  Jesús Labarta,et al.  Performance Modeling of HPC Applications , 2003, PARCO.

[15]  John Shalf,et al.  Programming Abstractions for Data Locality , 2014 .

[16]  J. J. Collins,et al.  An empirical study of data decomposition for software parallelization , 2017, J. Syst. Softw..

[17]  Alexey Lastovetsky,et al.  A Novel Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms , 2018, IEEE Transactions on Parallel and Distributed Systems.

[18]  Peng Zhang,et al.  A Survey of Homogeneous and Heterogeneous System Architectures in High Performance Computing , 2016, 2016 IEEE International Conference on Smart Cloud (SmartCloud).

[19]  Mohammed J. Zaki,et al.  Compile-Time Scheduling Algorithms for a Heterogeneous Network of Workstations , 1997, Comput. J..

[20]  John Shalf,et al.  Trends in Data Locality Abstractions for HPC Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[21]  Alexey L. Lastovetsky,et al.  New Model-Based Methods and Algorithms for Performance and Energy Optimization of Data Parallel Applications on Homogeneous Multicore Clusters , 2017, IEEE Transactions on Parallel and Distributed Systems.

[22]  Alexey L. Lastovetsky,et al.  Heterogeneous Distribution of Computations Solving Linear Algebra Problems on Networks of Heterogeneous Computers , 2001, J. Parallel Distributed Comput..

[23]  John Shalf,et al.  Overlapping Data Transfers with Computation on GPU with Tiles , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[24]  Brett A. Becker High-Level Data Partitioning for Parallel Computing on Heterogeneous Hierarchical Computational Plat , 2010 .

[25]  Alexey L. Lastovetsky,et al.  Model-Based Optimization of EULAG Kernel on Intel Xeon Phi Through Load Imbalancing , 2017, IEEE Transactions on Parallel and Distributed Systems.

[26]  Alexey Lastovetsky,et al.  A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes , 2020, IEEE Access.

[27]  Alexey L. Lastovetsky,et al.  Data partitioning with a realistic performance model of networks of heterogeneous computers with task size limits , 2004, Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks.