Allowing for ILP in an embedded Java processor

Java processors are ideal for embedded and network computing applications such as Internet TV's, set-top boxes, smart phones, and other consumer electronics applications. In this paper we investigate cost-effective microarchitectural techniques to exploit parallelism in Java bytecode streams. Firstly, we propose the use of a fill unit that stores decoded bytecodes into a decoded bytecode cache. This mechanism improves the fetch and decode bandwidth of Java processors by 2 to 3 times. These additional hardware units can also be used to perform optimizations such as instruction folding. This is particularly significant because experiments with the Verilog model of Sun Microsystems picoJava-II core demonstrates that instruction folding lies in the critical path. Moving folding logic from the critical path of the processor to the fill unit allows to improve the clock frequency by 25%. Out-of-order ILP exploitation is not investigated due to the prohibitive cost, but in-order dual-issue with a 64-entry decoded bytecode cache is seen to result in 10% to 14% improvement in execution cycles. Another contribution of the paper is a stack disambiguation technique that allows elimination of false dependencies between different types of stack: accesses. Stack disambiguation further exposes parallelism and a dual in-order issue microengine with a 64-entry bytecode cache yields an additional 10% reduction in cycles, leading to an aggregate reduction of 17% to 24% in execution cycles.

[1]  Narayanan Vijaykrishnan,et al.  Architectural issues in Java runtime systems , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[2]  Narayanan Vijaykrishnan,et al.  Object-Oriented Architectural Support for a Java Processor , 1998, ECOOP.

[3]  IEEE Micro , 2022 .

[4]  Ramesh Radhakrishnan,et al.  Characterization of Java applications at bytecode and ultra-SPARC machine code levels , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[5]  Yale N. Patt,et al.  Run-time generation of HPS microinstructions from a VAX instruction stream , 1986, MICRO 19.

[6]  Mike O'Connor,et al.  PicoJava: A Direct Execution Engine For Java Bytecode , 1998, Computer.

[7]  Yale N. Patt,et al.  Putting the fill unit to work: dynamic optimizations for trace cache microprocessors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[8]  J. Michael O'Connor,et al.  picoJava-I: the Java virtual machine in hardware , 1997, IEEE Micro.

[9]  Quinn Jacobson,et al.  Instruction pre-processing in trace processors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[10]  Yale N. Patt,et al.  Critical issues regarding HPS, a high performance microarchitecture , 1985, MICRO 18.

[11]  Yale N. Patt,et al.  HPS, a new microarchitecture: rationale and introduction , 1985, MICRO 18.

[12]  S. L. Zelen Rationale and Introduction , 1987 .

[13]  Alessandro De Gloria,et al.  Ultrasparc Instruction Level Characterization of Java Virtual Machine Workload , 1999 .

[14]  Manoj Franklin,et al.  A fill-unit approach to multiple instruction issue , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[15]  Yale N. Patt,et al.  HPSm, a high performance restricted data flow architecture having minimal functionality , 1986, ISCA '98.

[16]  Alec Wolman,et al.  The structure and performance of interpreters , 1996, ASPLOS VII.

[17]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[18]  Lung-Chung Chang,et al.  Stack operations folding in Java processors , 1998 .

[19]  Robert Wilson,et al.  Compiling Java just in time , 1997, IEEE Micro.