Efficient Cache Organization For Application Specific And General Purpose Processors

The performance gap between processor and memory continues to remain a major performance bottleneck in both application specific and general purpose processors. This thesis strives to ease the above bottleneck by exploiting the characteristics of the application domain to improve the cache organization for two distinct processor architectures: (1) application specific processors for packet forwarding, (2) general purpose processors. Packet forwarding algorithms make use of a trie data structure to determine the forwarding route. We observe that the locality characteristics of the nodes at various levels of such a trie are different. Nodes that are closer to the root node, especially those that are immediate children of the root node (level-one nodes), exhibit higher temporal locality than nodes lower down the trie. Based on this observation we propose a novel Heterogeneously Segmented Cache Architecture (HSCA) that uses separate caches for level-one and lower-level nodes, each with carefully chosen sizes. We also propose a new replacement policy to enhance the performance of HSCA. Performance evaluation indicates that HSCA results in up to 32% reduction in average memory access time over a unified cache that shares the same cache space among all levels of the trie. HSCA also outperforms a previously proposed results cache. The use of a large root branching factor in a forwarding trie forcefully introduces a large number of nodes at level-one. Among these, only nodes that cover prefixes from the routing table are useful while the rest, are superfluous. We find that as many as 75% of the level-one nodes are superfluous. This leads to a skewed distribution of useful nodes among the cache sets of the level-one nodes cache. We propose a novel two-level mapping framework that achieves a better nodes to cache set mapping and

[1]  Hung-Hsiang Jonathan Chao,et al.  Next generation routers , 2002, Proc. IEEE.

[2]  Kimberly Claffy,et al.  Internet traffic characterization , 1994 .

[3]  David A. Wood,et al.  Adaptive cache compression for high-performance processors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[4]  Butler W. Lampson,et al.  IP lookups using multiway and multicolumn search , 1998, Proceedings. IEEE INFOCOM '98, the Conference on Computer Communications. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies. Gateway to the 21st Century (Cat. No.98.

[5]  Jaejin Lee,et al.  Using prime numbers for cache indexing to eliminate conflict misses , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[6]  Yale N. Patt,et al.  The V-Way cache: demand-based associativity via global replacement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[7]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[8]  Walid Dabbous,et al.  Survey and taxonomy of IP address lookup algorithms , 2001, IEEE Netw..

[9]  J.J. Navarro,et al.  The Difference-Bit Cache , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[10]  Jean-Loup Baer,et al.  Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[11]  Raj Jain Characteristics of Destination Address Locality in Computer Networks: A Comparison of Caching Schemes , 1989, Comput. Networks ISDN Syst..

[12]  T. N. Vijaykumar,et al.  Efficient use of memory bandwidth to improve network processor throughput , 2003, ISCA '03.

[13]  J. C. Liu,et al.  Modified LC-trie based efficient routing lookup , 2002, Proceedings. 10th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunications Systems.

[14]  Brad Calder,et al.  Motivation for Variable Length Intervals and Hierarchical Phase Behavior , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[15]  G. Varghese,et al.  A pipelined memory architecture for high throughput network processors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[16]  Santosh G. Abraham,et al.  Efficient simulation of caches under optimal replacement with applications to miss characterization , 1993, SIGMETRICS '93.

[17]  Timothy Sherwood,et al.  Virtually Pipelined Network Memory , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[18]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[19]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[20]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2005, IEEE Micro.

[21]  José González,et al.  The design and performance of a conflict-avoiding cache , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[22]  Bernhard Plattner,et al.  Scalable high speed IP routing lookups , 1997, SIGCOMM '97.

[23]  Patrick Crowley,et al.  Network Processor Design: Issues and Practices , 2002 .

[24]  Gunnar Karlsson,et al.  IP-address lookup using LC-tries , 1999, IEEE J. Sel. Areas Commun..

[25]  Jih-Kwon Peir,et al.  Capturing dynamic memory reference behavior with adaptive cache topology , 1998, ASPLOS VIII.

[26]  Jean-Didier Legat,et al.  Application-Specific Reconfigurable XOR-Indexing to Eliminate Cache Conflict Misses , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[27]  Nick McKeown,et al.  Routing lookups in hardware at memory access speeds , 1998, Proceedings. IEEE INFOCOM '98, the Conference on Computer Communications. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies. Gateway to the 21st Century (Cat. No.98.

[28]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[29]  Francis Zane,et al.  Performance modeling for fast IP lookups , 2001, SIGMETRICS '01.

[30]  Svante Carlsson,et al.  Small forwarding tables for fast routing lookups , 1997, SIGCOMM '97.

[31]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[32]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[33]  T. N. Vijaykumar,et al.  Distance associativity for high-performance energy-efficient non-uniform cache architectures , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[34]  Bill Lin,et al.  IP caching for terabit speed routers , 1999, Seamless Interconnection for Universal Services. Global Telecommunications Conference. GLOBECOM'99. (Cat. No.99CH37042).

[35]  R. Govindarajan,et al.  Performance modeling and architecture exploration of network processors , 2005, Second International Conference on the Quantitative Evaluation of Systems (QEST'05).

[36]  Anant Agarwal,et al.  Column-associative caches: a technique for reducing the miss rate of direct-mapped caches , 1993, ISCA '93.

[37]  Chen Ding,et al.  Locality phase prediction , 2004, ASPLOS XI.

[38]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[39]  Mateo Valero,et al.  Eliminating cache conflict misses through XOR-based placement functions , 1997, ICS '97.

[40]  Keith Sklower,et al.  A Tree-Based Packet Routing Table for Berkeley Unix , 1991, USENIX Winter.

[41]  Tzi-cker Chiueh,et al.  Improving Route Lookup Performance Using Network Processor Cache , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[42]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[43]  Yuan Chou,et al.  Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[44]  Tzi-cker Chiueh,et al.  Cache memory design for network processors , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[45]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[46]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[47]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[48]  Anne Rogers,et al.  Early Experiences with Olden , 1993, LCPC.

[49]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[50]  Huan Liu Reducing cache miss ratio for routing prefix cache , 2002, Global Telecommunications Conference, 2002. GLOBECOM '02. IEEE.

[51]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[52]  Tony Givargis Improved indexing for cache miss reduction in embedded systems , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[53]  José F. Martínez,et al.  Scavenger: A New Last Level Cache Architecture with Global Block Priority , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[54]  Jean-Loup Baer,et al.  Memory hierarchy design for a multiprocessor look-up engine , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[55]  José G. Delgado-Frias,et al.  An associative ternary cache scheme for ip routing , 2002 .

[56]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[57]  Steven K. Reinhardt,et al.  A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[58]  P. Gburzynski,et al.  Synthetic trace generation for the Internet , 2001, Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538).

[59]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[60]  Nick McKeown,et al.  Algorithms for packet classification , 2001, IEEE Netw..

[61]  Marcel Waldvogel,et al.  IBM PowerNP network processor: Hardware, software, and applications , 2003, IBM J. Res. Dev..

[62]  Chuanjun Zhang Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[63]  Harrick M. Vin,et al.  Addressing the memory bottleneck in packet processing systems , 2006 .

[64]  Joel L. Wolf,et al.  Synthetic Traces for Trace-Driven Simulation of Cache Memories , 1992, IEEE Trans. Computers.

[65]  Yan Solihin,et al.  An analytical model for cache replacement policy performance , 2006, SIGMETRICS '06/Performance '06.

[66]  Huan Liu,et al.  Routing prefix caching in network processor design , 2001, Proceedings Tenth International Conference on Computer Communications and Networks (Cat. No.01EX495).

[67]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[68]  V. Srinivasan,et al.  Fast address lookups using controlled prefix expansion , 1999, TOCS.

[69]  Harrick M. Vin,et al.  Overcoming the memory wall in packet processing , 2005 .

[70]  Yannis Smaragdakis,et al.  Adaptive Caches: Effective Shaping of Cache Behavior to Workloads , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[71]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .