Efficient compilation for queue size constrained queue processors

Queue computers use a FIFO data structure for data processing. The essential characteristics of a queue-based architecture excel at satisfying the demands of embedded systems, including compact instruction set, simple hardware logic, high parallelism, and low power consumption. The size of the queue is an important concern in the design of a realizable embedded queue processor. We introduce the relationship between parallelism, length of data dependency edges in data flow graphs and the queue utilization requirements. This paper presents a technique developed to make the compiler aware of the size of the queue register file and, thus, optimize the programs to effectively utilize the available hardware. The compiler examines the data flow graph of the programs and partitions it into clusters whenever it exceeds the queue limits of the target architecture. The presented algorithm deals with the two factors that affect the utilization of the queue, namely parallelism and the length of variables' reaching definitions. We analyze how the quality of the generated code is affected for SPEC CINT95 benchmark programs and different queue size configurations. Our results show that for reasonable queue sizes the compiler generates a code that is comparable to the code generated for infinite resources in terms of instruction count, static execution time, and instruction level parallelism.

[1]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[2]  Dezsö Sima,et al.  The Design Space of Register Renaming Techniques , 2000, IEEE Micro.

[3]  Lenwood S. Heath,et al.  Stack and Queue Layouts of Directed Acyclic Graphs: Part I , 1999, SIAM J. Comput..

[4]  Liam Goudge,et al.  Thumb: reducing the cost of 32-bit RISC performance in portable and consumer applications , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.

[5]  Thomas D. Burd,et al.  Processor design for portable systems , 1996, J. VLSI Signal Process..

[6]  Tsutomu Yoshinaga,et al.  Parallel Queue Processor Architecture Based on Produced Order Computation Model , 2005, The Journal of Supercomputing.

[7]  Aviral Shrivastava,et al.  Compilation framework for code size reduction using reduced bit-width ISAs (rISAs) , 2006, TODE.

[8]  Masahiro Sowa,et al.  Design of a superscalar processor based on queue machine computation model , 1999, 1999 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM 1999). Conference Proceedings (Cat. No.99CH36368).

[9]  Gary S. Tyson,et al.  Register queues: a new hardware/software approach to efficient software pipelining , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[10]  Herman Schmit,et al.  Queue machines: hardware compilation in hardware , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[11]  Manish Gupta,et al.  Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors , 2000, IEEE Micro.

[12]  Charles H. Moore,et al.  The evolution of Forth , 1996 .

[13]  Jozo J. Dujmovic,et al.  Evolution and evaluation of SPEC benchmarks , 1998, PERV.

[14]  Philip H. Sweany,et al.  A Code Generation Framework for VLIW Architectures with Partitioned Register Banks , 2007 .

[15]  Huiyang Zhou,et al.  Code size efficiency in global scheduling for ILP processors , 2002, Proceedings Sixth Annual Workshop on Interaction between Compilers and Computer Architectures.

[16]  Arquimedes Canedo,et al.  Queue Register File Optimization Algorithm for QueueCore Processor , 2007 .

[17]  Kevin D. Kissell MIPS16: High-density MIPS for the Embedded Market1 , 1997 .

[18]  Javier Zalamea,et al.  Software and Hardware Techniques to Optimize Register File Utilization in VLIW Architectures , 2004, International Journal of Parallel Programming.

[19]  Lenwood S. Heath,et al.  Laying out Graphs Using Queues , 1992, SIAM J. Comput..

[20]  Ikuya Kawasaki,et al.  SH3: high code density, low power , 1995, IEEE Micro.

[21]  Scott A. Mahlke,et al.  Partitioning variables across register windows to reduce spill code in a low-power processor , 2005, IEEE Transactions on Computers.

[22]  Josep Llosa,et al.  Quantitative Evaluation of Register Pressure on Software Pipelined Loops , 1998, International Journal of Parallel Programming.

[23]  Alexander V. Veidenbaum,et al.  Power-Aware Compilation for Register File Energy Reduction , 2004, International Journal of Parallel Programming.

[24]  Manuel E. Benitez,et al.  Code generation for streaming: an access/execute mechanism , 1991, ASPLOS IV.

[25]  Jr. Philip J. Koopman,et al.  Stack computers: the new wave , 1989 .

[26]  Bruno R. Preiss,et al.  Data flow on a queue machine , 1985, ISCA 1985.

[27]  Huibin Shi,et al.  Investigating available instruction level parallelism for stack based machine architectures , 2004 .

[28]  Gürhan Küçük,et al.  Energy Efficient Register Renaming , 2003, PATMOS.

[29]  Mike O'Connor,et al.  PicoJava: A Direct Execution Engine For Java Bytecode , 1998, Computer.

[30]  Hyuk-Jae Lee,et al.  PARE: instruction set architecture for efficient code size reduction , 1999 .

[31]  Henk Corporaal,et al.  Partitioned register file for TTAs , 1995, MICRO 1995.

[32]  Makoto Hasegawa,et al.  High-speed top-of-stack scheme for VLSI processor: a management algorithm and its analysis , 1985, ISCA '85.

[33]  Norman P. Jouppi,et al.  Register file design considerations in dynamically scheduled processors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[34]  Arquimedes Canedo,et al.  A new code generation algorithm for 2-offset producer order queue computation model , 2008, Comput. Lang. Syst. Struct..

[35]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '98.

[36]  Andrew Kennedy,et al.  Design and implementation of generics for the .NET Common language runtime , 2001, PLDI '01.

[37]  Russell P. Blake Exploring a Stack Architecture , 1977, Computer.

[38]  Tsutomu Yoshinaga,et al.  High-Level Modeling and FPGA Prototyping of Produced Order Parallel Queue Processor Core , 2006, The Journal of Supercomputing.

[39]  Frank Yellin,et al.  The Java Virtual Machine Specification , 1996 .

[40]  Kenneth C. Louden P-code and compiler portability: experience with a Modula-2 optimizing compiler , 1990, SIGP.

[41]  Wm. A. Wulf Evaluation of the WM architecture , 1992, ISCA '92.

[42]  Masahiro Sowa,et al.  Design and architecture for an embedded 32-bit QueueCore , 2006, J. Embed. Comput..