Fast Data Delivery for Many-Core Processors

Server workloads operate on large volumes of data. As a result, processors executing these workloads encounter frequent L1-D misses. In a many-core processor, an L1-D miss causes a request packet to be sent to an LLC slice and a response packet to be sent back to the L1-D, which results in high overhead. While prior work targeted response packets, this work focuses on accelerating the request packets. Unlike aggressive OoO cores, simpler cores used in many-core processors cannot hide the latency of L1-D request packets. We observe that LLC slices that serve L1-D misses are strongly temporally correlated. Taking advantage of this observation, we design a simple and accurate predictor. Upon the occurrence of an L1-D miss, the predictor identifies the LLC slice that will serve the next L1-D miss and a circuit will be set up for the upcoming miss request to accelerate its transmission. When the upcoming miss occurs, the resulting request can use the already established circuit for transmission to the LLC slice. We show that our proposal outperforms data prefetching mechanisms in a many-core processor due to (1) higher prediction accuracy and (2) not wasting valuable off-chip bandwidth, while requiring significantly less overhead. Using full-system simulation, we show that our proposal accelerates serving data misses by 22 percent and leads to 10 percent performance improvement over the state-of-the-art network-on-chip.

[1]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[2]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[3]  Pejman Lotfi-Kamran,et al.  An Efficient Temporal Data Prefetcher for L1 Caches , 2017, IEEE Computer Architecture Letters.

[4]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[5]  Nan Jiang,et al.  A detailed and flexible cycle-accurate Network-on-Chip simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[6]  Onur Mutlu,et al.  Express Cube Topologies for on-Chip Interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[7]  William J. Dally,et al.  Flit-reservation flow control , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[8]  Seth H. Pugsley,et al.  Efficiently prefetching complex address patterns , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Babak Falsafi,et al.  SHIFT: Shared history instruction fetch for lean-core server processors , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Thomas F. Wenisch,et al.  Temporal streaming of shared memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[11]  Hamid Sarbazi-Azad,et al.  An Efficient Hybrid-Switched Network-on-Chip for Chip Multiprocessors , 2016, IEEE Transactions on Computers.

[12]  Babak Falsafi,et al.  NOC-Out: Microarchitecting a Scale-Out Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[13]  Rami G. Melhem,et al.  Proactive circuit allocation in multiplane NoCs , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[14]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[15]  Onur Mutlu,et al.  A case for bufferless routing in on-chip networks , 2009, ISCA '09.

[16]  George Michelogiannakis,et al.  An analysis of on-chip interconnection networks for large-scale chip multiprocessors , 2010, TACO.

[17]  Mehmet Kayaalp,et al.  RIC: Relaxed Inclusion Caches for mitigating LLC side-channel attacks , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[18]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[19]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[20]  Hamid Sarbazi-Azad,et al.  Domino Temporal Data Prefetcher , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[22]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[23]  Hamid Sarbazi-Azad,et al.  Near-Ideal Networks-on-Chip for Servers , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24]  Babak Falsafi,et al.  Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[25]  Pierre Michaud Best-offset hardware prefetching , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[26]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[27]  Niraj K. Jha,et al.  Express virtual channels: towards the ideal interconnection fabric , 2007, ISCA '07.

[28]  Natalie D. Enright Jerger,et al.  The runahead network-on-chip , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[30]  Pejman Lotfi-Kamran,et al.  Cache Replacement Policy Based on Expected Hit Count , 2018, IEEE Computer Architecture Letters.

[31]  Avinash Sodani,et al.  Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[32]  Onur Mutlu,et al.  Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[33]  Mehdi Modarressi,et al.  NOC characteristics of cloud applications , 2017, 2017 19th International Symposium on Computer Architecture and Digital Systems (CADS).

[34]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[35]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[36]  Babak Falsafi,et al.  Toward Dark Silicon in Servers , 2011, IEEE Micro.

[37]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[38]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[39]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).