Towards efficient fine-grain software pipelining

Dataflow software pipelining was proposed as a means of structuring fine-grain parallelism and has been studied mostly under an idealized dataflow architecture model with infinite resources[9]. In this paper, we investigate the effects of software pipelining under realistic architecture models with finite resources. Our target architecture is the McGill Dataflow Architecture which employs conventional pipelined techniques to achieve fast instruction execution, while exploiting fine-grain parallelism via a data-driven instruction scheduler. To achieve optimal execution efficiency, the compiled code must be able to make a balanced use of both the parallelism in the instruction execution unit and the fine-grain synchronization power of the machine. A detailed analysis based on simulation results is presented, focusing on two key architectural factors - the fine-grain synchronization capacity and the scheduling mechanism for enabling instructions. On one hand, our results provide experimental evidence that software pipelining is an effective method for exploiting fine-grain parallelism in loops. On the other, the experiments have also revealed the (somewhat pessimistic) fact that even a fully software pipelined code may not achieve good performance if the overhead for fine-grain synchronization exceeds the capacity of the machine.

[1]  Ian Watson,et al.  The Manchester prototype dataflow computer , 1985, CACM.

[2]  Guang R. Gao,et al.  An efficient pipelined dataflow processor architecture , 1988, Proceedings. SUPERCOMPUTING '88.

[3]  John W. Backus,et al.  Can programming be liberated from the von Neumann style?: a functional style and its algebra of programs , 1978, CACM.

[4]  Guang R. Gao,et al.  Dataflow software pipelining: a case study , 1990, Ninth Annual International Phoenix Conference on Computers and Communications. 1990 Conference Proceedings.

[5]  John Cocke,et al.  The search for performance in scientific processors: the Turing Award lecture , 1988, CACM.

[6]  Guang R. Gao,et al.  Design of an Efficient Dataflow Architecture without Data Flow , 1988, Fifth Generation Computer Systems.

[7]  David E. Culler,et al.  Dataflow architectures , 1986 .

[8]  David E. Culler,et al.  Resource requirements of dataflow programs , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[9]  Guang Rong Gao A pipelined code mapping scheme for static data flow computers , 1986 .

[10]  Roy F. Touzeau A Fortran compiler for the FPS-164 scientific computer , 1984, SIGPLAN '84.

[11]  Guang R. Gao Algorithmic Aspects of Balancing Techniques for Pipelined Data Flow Code Generation , 1989, J. Parallel Distributed Comput..

[12]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[13]  Philip Wadler A new array operation , 1986, Graph Reduction.

[14]  Paul Hudak,et al.  Arrays, non-determinism, side-effects, and parallelism: A functional perspective , 1987, Graph Reduction.

[15]  Guang R. Gao,et al.  Parallel function invocation in a dynamic argument-fetching dataflow architecture , 1990, Proceedings. PARBASE-90: International Conference on Databases, Parallel Architectures, and Their Applications.

[16]  Guang R. Gao,et al.  A Maximally Pipelined Tridiagonal Linear Equation Solver , 1986, J. Parallel Distributed Comput..

[17]  John Sargeant,et al.  Control of parallelism in the Manchester Dataflow Machine , 1987, FPCA.