Instruction Prefetching of Systems Codes with Layout Optimized for Reduced Cache Misses

High-performing on-chip instruction caches are crucial to keep fast processors busy. Unfortunately, while on-chip caches are usually successful at intercepting instruction fetches in loop-intensive engineering codes, they are less able to do so in large systems codes. To improve the performance of the latter codes, the compiler can be used to lay out the code in memory for reduced cache conflicts. Interestingly, such an operation leaves the code in a state that can be exploited by a new type of instruction prefetching: guarded sequential prefetching.The idea is that the compiler leaves hints in the code as to how the code was laid out. Then, at run time, the prefetching hardware detects these hints and uses them to prefetch more effectively. This scheme can be implemented very cheaply: one bit encoded in control transfer instructions and a prefetch module that requires minor extensions to existing next-line sequential prefetchers. Furthermore, the scheme can be turned off and on at run time with the toggling of a bit in the TLB. The scheme is evaluated with simulations using complete traces from a 4-processor machine. Overall, for 16-Kbyte primary instruction caches, guarded sequential prefetching removes, on average, 66% of the instruction misses remaining in an operating system with an optimized layout, speeding up the operating system by 10%. Moreover, the scheme is more cost-effective and robust than existing sequential prefetching techniques.

[1]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[2]  James E. Smith,et al.  Prefetching in supercomputer instruction caches , 1992, Proceedings Supercomputing '92.

[3]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[4]  Trevor N. Mudge,et al.  Instruction fetching: Coping with code bloat , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[5]  W. W. Hwu,et al.  Achieving high instruction cache performance with an optimizing compiler , 1989, ISCA '89.

[6]  Trevor N. Mudge,et al.  Optimal allocation of on-chip memory for multiple-API operating systems , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[7]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[8]  Thomas M. Conte,et al.  The Effect of Code Expanding Optimizations on Instruction Cache Design , 1993, IEEE Trans. Computers.

[9]  Trevor N. Mudge,et al.  Wrong-path instruction prefetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[10]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[11]  Josep Torrellas,et al.  Optimizing instruction cache performance for operating system intensive workloads , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[12]  Scott McFarling,et al.  Program optimization for instruction caches , 1989, ASPLOS III.

[13]  Josep Torrellas,et al.  Improving the data cache performance of multiprocessor operating systems , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[14]  Jean-Loup Baer,et al.  Instruction cache fetch policies for speculative execution , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.