论文信息 - Time-Optimal and Conflict-Free Mappings of Uniform Dependence Algorithms into Lower Dimensional Processor Arrays

Time-Optimal and Conflict-Free Mappings of Uniform Dependence Algorithms into Lower Dimensional Processor Arrays

Most existing methods of mapping algorithms into processor arrays are restricted to the case where n-dimensional algorithms or algorithms with n nested loops are mapped into (n —l)-dimensional arrays. However, in practice, it is interesting to map n-dimensional algorithms into (k —l)-dimensional arrays where k<.n. For example, many algorithms at bit-level are at least 4-dimensional (matrix multiplication, convolution, LU decomposition, etc.) and most existing bit level processor arrays are 2-dimensional. A computational conflict occurs if two or more computations of an algorithm are mapped into the same processor and the same execution time. In this paper, necessary and sufficient conditions are derived to identify all mappings without computational conflicts, based on the Hermite nor mal form of the mapping matrix. These conditions are used to propose methods of mapping any n-dimensional algorithm into (k— l)-dimensional arrays, k<n, without computational conflicts. When k> n—3, optimality of the mapping is guaranteed. This research was supported in part by the National Science Foundation under Grant DC1-8419745 and in part by the Innovative Science and Technology Office of the Strategic Defense Initiative Organization and was administered through the Office of Naval Research under contracts No. 00014-85-k-0588 and No. 00014-88-k-0723. L b to fS y m b c ik Bi (n—l)x(n—I) matrix; T̂ [J9> b). _ bt ( n y column vector; T ==[B, 6]. Dt dependence matrix; Definition 2.1. detB: determinant of matrix B. dji dependence (column) Vector with n components; Definition 2.1 (4). Ht the Hermite normal form of mapping matrix T; Theorem 4.1. Ii identity matrix. /: index set; Definition 2.1 (I). j: column vector; index point; Definition 2.1 (I). k: number of rows of mapping matrix T\ Definition 2.2. m: number of dependence vectors in D mi Definition 2.1 (4). IV: set of non-negative integers. jv+ : Set of positive integers. n: algorithm dimension or number of entries of index points in Jy Definition 2.1 (I). rank{A): rank of matrix A. St space mapping matrix; Definition 2.2. Sijt entry of S' at ith tow and jth column. T: mapping matrix; Definition 2.2. U: multiplier of the Hermite normal form; Theorem 4.1. • • Uqi entry of U at ith row and jth column. 7: inverse of U) Theorem 4.1. g entry of F at ith row and jih column. Zi set of integers. II: row vector; linear schedule vector; Definition 2,2. IP: row vector; optimal solution of Problem 2.2. pi ^ t f P Pi:, Uh entry of ft. 7: conflict vector; Definition 2.3. 7 ,-: ith entry of 7 . IXiY the ith upper bound of /; Equation 2.5. t: linear mapping of algorithms into arrays; Definition 2.2. 0: empty set. 0: a column or row vector whose entries are all 0. I C I ; cardinality of set (7. j a J : absolute value of Of. I , JO T m O B W IlO N . ’M sa |'« d s lb j methods of mapping algorithms into processor arrays are restriotodl to t ie eases where. K-dimensiona] algorithms, or algorithms with n nested loops, are mapped Into fm—IjhdImeosIoaaI processor arrays [2-13], For example, the 3-dimensional matrix multiplication algorithm is usually mapped into a 2dimensjonal processor array by these methods [10],,. [21],, [32] , This paper considers mappings, of n-dimenslonal algorithms into (k — l)-dimensional, k< n% processor arrays. Procedures ate proposed to find mappings without computational conflicts, whieh means, no two or more computations of the algorithm are mapped into the same processor and execution time. When k>n — 3, these mappings are time-optimal. In simple terms, the algorithms under consideration in this paper are called uniform dependence algorithms and can model nested loop algorithms. They are represented as partially ordered subsets of a multidimensional integer lattice (called index sets). The points of this lattice correspond to (or index) computa tions, and the partial order reflects the data dependencies between these computa tions. These data dependencies are represented as vectors that connect points of the lattice. Informally, if a given dependence vector is always present when the vector difference between any two lattice points equals the dependence vector, then the dependence is said to be uniform. If all dependencies are uniform then the algorithm is said to be a uniform dependence algorithm. This algorithm model can be easily related to similar models and concepts in [1-13], [19] and several other works. Examples of 2-dimensional bit-level processor arrays include GAPP [33], DAP [34], MPP [35], Connection Machine [31] etc. Many bit level algorithms are four or five dimensional, such as matrix multiplication, convolution, LU decomposition, etc. How to automatically map these algorithms into 2-dimensional bit level arrays is still a problem [28], That is why in practice it is interesting to develop a method to map n-dimensional algorithms into (fc —I )-dimensional processor arrays with k<n. This work was motivated by the implementation of RAB (Reconfiguration Algorithm for Bit level code) [26], an experimental tool which maps a class of algo rithms programmed in ’C’ into bit level arrays. In this approach, algorithms are first expanded into bit level algorithms, and second, the dependence relations are analyzed and the algorithm is uniformized. Then the global optimal solution, which maps often a four or five dimensional bit level algorithm into a 2dimensional bit level processor array, is to be found. Several attempts have been made to try to map algorithms into lower dimen sional systolic arrays [15], [22], [23] [25]. In particular, important steps towards a formal solution to this problem were made in [23]. Based on the Lamport hyper plane transformation model [13], a procedure was proposed to find mappings of 3dimensional algorithms into 1-dimensional or linear systolic arrays without com putational conflicts and data link collisions. Five conditions were given to guaran tee the correctness of the mapping. The first condition ensures that dependence relations among different computations of the algorithm are respected; the second condition is about computational conflicts; the third and fourth conditions deal with the number of shift registers on links and the data travel directions; and the fifth condition is to avoid data link collisions. The concept of data link collisions and the conditions to avoid such collisions are introduced in this work. Detection of computational conflicts is basically by analysis of all computations of the algo rithm and the optimality of the mapping is not guaranteed. In [22], further results are reported in mapping n-dimensional algorithms into (k — l)-dimensional processor arrays. A suboptimal solution for the reindexed transitive closure algo rithm [17] [23] was found by the proposed procedure in [22] by which the total execution time is /i(2/i-(-3)+l where H is the problem size. This paper describes a method of mapping n-dimensional uniform dependence algorithms into (k — l)-dimensional arrays, fc<n, without any computational conflicts. Based on the Hermite normal form of the mapping matrix, simple and easy-to-use necessary and sufficient conditions are derived to guarantee a conflictfree mapping. These conditions are used to formulate the problem of finding time-optimal and conflict-free mappings as an integer programming problem. Optimality is always guaranteed for the mapping of n-dimensional algorithms, into (n—j)-dimensional, *=1, ..., 4, processor arrays. Compared to the method in [22] and [23], the main contribution of this paper is the easy-to-use and closed form necessary and sufficient conditions for conflict-free mappings. In addition, based on these conditions, this paper formulates the problem of identifying time-optimal and conflict-free mappings as an integer programming optimization problem. For some algorithms such as the matrix multiplication algorithm and the transitive closure algorithm, the integer programming formulation can be further converted to linear programming problems. In Section 5, the method proposed in this paper is used to find the optimal solution for the reindexed transitive closure algorithm which improves the total execution time of /i(2/^+3)+l in [22] to fj,(/jt+3)+l. This paper is organized as follows. Section 2 presents basic terminology and definitions, introduces the concept of computational conflicts and provides state ments of problems addressed in this paper. Section 3 discusses a simple case to illustrate different aspects of, and provide insight into, the conflict-free mapping problem. Section 4 discusses the conflict-free mapping problem in general. Sec tion 5 presents an optimization procedure and integer programming problem*for mulations which find the time-optimal mapping without any computational conflicts. Section 6 concludes this paper and points out some future work. 2. TERMINOLOGY AND DEFINITIONS Throughout this paper, sets, matrices and row vectors are denoted by capital letters, column vectors are represented by lower case symbols with an overbar and scalars correspond to lower case letters. The transpose of a vector Fis denoted v . The vector 0 denotes the row or column vector whose entries are all zeroes. The dimensions of vector 0 and whether it denotes a row or column vector are implied

Weijia Shang | José A. B. Fortes | J. Fortes | Weijia Shang

[1] Marina C. Chen,et al. A Design Methodology for Synthesizing Parallel Algorithms and Architectures , 1986, J. Parallel Distributed Comput..

[2] Rami G. Melhem,et al. Synthesizing Non-Uniform Systolic Designs , 1986, ICPP.

[3] Richard M. Karp,et al. The Organization of Computations for Uniform Recurrence Equations , 1967, JACM.

[4] S. Kung,et al. VLSI Array processors , 1985, IEEE ASSP Magazine.

[5] J.A.B. Fortes,et al. Bit level processor arrays: current architectures and a design and programming tool , 1988, 1988., IEEE International Symposium on Circuits and Systems.

[6] Thomas Kailath,et al. Regular iterative algorithms and their implementation on processor arrays , 1988, Proc. IEEE.

[7] PEIZONG LEE,et al. Synthesizing Linear Array Algorithms from Nested For Loop Algorithms , 2015, IEEE Trans. Computers.

[8] D.I. Moldovan,et al. On the design of algorithms for VLSI systolic arrays , 1983, Proceedings of the IEEE.

[9] Patrice Quinton. Automatic synthesis of systolic arrays from uniform recurrent equations , 1984, ISCA '84.

[10] Dan I. Moldovan,et al. Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.

[11] Kenneth E. Batcher,et al. Bit-Serial Parallel Processing Systems , 1982, IEEE Transactions on Computers.

[12] Peter R. Cappello,et al. Unifying VLSI Array Designs with Geometric Transformations , 1983, International Conference on Parallel Processing.

[13] Weijia Shang. Scheduling, partitioning and mapping of uniform dependence algorithms on processor arrays , 1990 .