ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers

Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable — as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of multiprogrammed SPEC 2006 workloads and systems with 4–32 cores and 1–16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS's performance benefit increases as the number of cores increases.

[1]  Linus Schrage,et al.  Letter to the Editor - A Proof of the Optimality of the Shortest Remaining Processing Time Discipline , 1968, Oper. Res..

[2]  William Jalby,et al.  XOR-Schemes: A Flexible Data Organization in Parallel Memories , 1985, ICPP.

[3]  J. George Shanthikumar,et al.  Scheduling Multiclass Single Server Queueing Systems to Stochastically Maximize the Number of Successful Departures , 1989, Probability in the Engineering and Informational Sciences.

[4]  Scott Shenker,et al.  Analysis and simulation of a fair queueing algorithm , 1989, SIGCOMM '89.

[5]  Scott Shenker,et al.  Analysis and simulation of a fair queueing algorithm , 1989, SIGCOMM 1989.

[6]  B. Ramakrishna Rau,et al.  Pseudo-randomly interleaved memory , 1991, ISCA '91.

[7]  Stephen Deering,et al.  Multicast routing in a datagram internetwork , 1992 .

[8]  Eduard Ayguadé,et al.  Increasing the number of strides for conflict-free vector access , 1992, ISCA '92.

[9]  Madhu Sudan,et al.  Priority encoding transmission , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[10]  V. Paxson,et al.  Wide-area traffic: the failure of Poisson modeling , 1994, SIGCOMM.

[11]  Christos Papadopoulos,et al.  Retransmission-Based Error Control for Continuous Media Applications , 1996 .

[12]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1996, SIGMETRICS '96.

[13]  David Wetherall,et al.  Towards an active network architecture , 1996, CCRV.

[14]  Mor Harchol-Balter,et al.  Exploiting process lifetime distributions for dynamic load balancing , 1995, SIGMETRICS.

[15]  Hang Liu,et al.  Performance of H.263 Video Transmission over Wireless Channels Using Hybrid ARQ , 1997, IEEE J. Sel. Areas Commun..

[16]  Steven McCanne,et al.  Low-Complexity Video Coding for Receiver-Driven Layered Multicast , 1997, IEEE J. Sel. Areas Commun..

[17]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[18]  Injong Rhee,et al.  Error control techniques for interactive low-bit rate video transmission over the Internet , 1998, SIGCOMM '98.

[19]  M. Crovella,et al.  Heavy-tailed probability distributions in the World Wide Web , 1998 .

[20]  Michael Luby,et al.  A digital fountain approach to reliable distribution of bulk data , 1998, SIGCOMM '98.

[21]  Bernd Girod,et al.  Robust Internet video transmission based on scalable coding and unequal error protection , 1999, Signal Process. Image Commun..

[22]  Eddie Kohler,et al.  The Click modular router , 1999, SOSP.

[23]  K. Ramchandran,et al.  Multiple description source coding using forward error correction codes , 1999, Conference Record of the Thirty-Third Asilomar Conference on Signals, Systems, and Computers (Cat. No.CH37020).

[24]  Donald F. Towsley,et al.  Adaptive FEC-based error control for Internet telephony , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[25]  Sally A. McKee,et al.  Access order and effective bandwidth for streams on a Direct Rambus memory , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[26]  David Mazières,et al.  Separating key management from file system security , 1999, SOSP.

[27]  Anees Shaikh,et al.  Load-sensitive routing of long-lived IP flows , 1999, SIGCOMM '99.

[28]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[29]  Hari Balakrishnan,et al.  An end-to-end approach to host mobility , 2000, MobiCom '00.

[30]  Sally A. McKee,et al.  Dynamic Access Ordering for Streamed Computations , 2000, IEEE Trans. Computers.

[31]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[32]  Srinivasan Seshan,et al.  A unified header compression framework for low-bandwidth links , 2000, MobiCom '00.

[33]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[34]  Randy H. Katz,et al.  The Eifel algorithm: making TCP robust against spurious retransmissions , 2000, CCRV.

[35]  Alvin R. Lebeck,et al.  Power aware page allocation , 2000, SIGP.

[36]  Sally A. McKee,et al.  Hardware-only stream prefetching and dynamic access ordering , 2000, ICS '00.

[37]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[38]  Mark Handley,et al.  Equation-based congestion control for unicast applications , 2000, SIGCOMM.

[39]  Pascal Frossard,et al.  Joint source/FEC rate selection for quality-optimal MPEG-2 video delivery , 2001, IEEE Trans. Image Process..

[40]  Manoj Franklin,et al.  Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[41]  Hayder Radha,et al.  On retransmission schemes for real-time streaming in the Internet , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[42]  Zhen Fang,et al.  The Impulse Memory Controller , 2001, IEEE Trans. Computers.

[43]  Nick Feamster,et al.  Packet Loss Recovery for Streaming Video , 2002 .

[44]  Mor Harchol-Balter Task assignment with unknown duration , 2002, JACM.

[45]  Faye A. Briggs,et al.  Intel 870: a building block for cost-effective, scalable servers , 2002, IEEE Micro.

[46]  Mark Allman,et al.  On making TCP more robust to packet reordering , 2002, CCRV.

[47]  Lars C. Wolf,et al.  On the impact of delay on real-time multiplayer games , 2002, NOSSDAV '02.

[48]  Tejas Karkhanis,et al.  A Day in the Life of a Data Cache Miss , 2002 .

[49]  Randy H. Katz,et al.  USENIX Association Proceedings of MobiSys 2003 : The First International Conference on Mobile Systems , Applications , and Services , 2003 .

[50]  Mark Handley,et al.  From protocol stack to protocol heap: role-based architecture , 2003, CCRV.

[51]  Umar Saif,et al.  USENIX Association Proceedings of MobiSys 2003 : The First International Conference on Mobile Systems , Applications , and Services , 2003 .

[52]  Guillaume Urvoy-Keller,et al.  Analysis of LAS scheduling for job size distributions with high variance , 2003, SIGMETRICS '03.

[53]  Helen J. Wang,et al.  LAYERED MULTIPLE DESCRIPTION CODING , 2003 .

[54]  Adam Wolisz,et al.  EvalVid - A Framework for Video Transmission and Quality Evaluation , 2003, Computer Performance Evaluation / TOOLS.

[55]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[56]  Calvin Lin,et al.  Adaptive History-Based Memory Schedulers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[57]  Michael Walfish,et al.  A layered naming architecture for the internet , 2004, SIGCOMM '04.

[58]  Jin Cao,et al.  Stochastic models for generating synthetic HTTP source traffic , 2004, IEEE INFOCOM 2004.

[59]  Scott Rixner,et al.  Memory Controller Optimizations for Web Servers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[60]  I. Stoica,et al.  Internet indirection infrastructure , 2002, SIGCOMM '02.

[61]  Jia Wang,et al.  Locating internet bottlenecks: algorithms, measurements, and implications , 2004, SIGCOMM '04.

[62]  Guido Appenzeller,et al.  Sizing router buffers , 2004, SIGCOMM '04.

[63]  Kai Cheng,et al.  Micro-architecture techniques in the intel® E8870 scalable memory controller , 2004, WMPI '04.

[64]  Faye A. Briggs,et al.  A study of performance impact of memory controller features in multi-processor server environment , 2004, WMPI '04.

[65]  Mor Harchol-Balter,et al.  Evaluation of Task Assignment Policies for Supercomputing Servers: The Case for Load Unbalancing and Fairness , 2004, Cluster Computing.

[66]  David D. Clark,et al.  Tussle in cyberspace: defining tomorrow's Internet , 2002, IEEE/ACM Transactions on Networking.

[67]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[68]  Scott Shenker,et al.  Overcoming the Internet impasse through virtualization , 2005, Computer.

[69]  Zhao Zhang,et al.  A performance comparison of DRAM memory system optimizations for SMT processors , 2005, 11th International Symposium on High-Performance Computer Architecture.

[70]  Steven McCanne,et al.  Towards an evolvable internet architecture , 2005, SIGCOMM '05.

[71]  James E. Smith,et al.  Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[72]  Ali C. Begen,et al.  Redundancy-controllable adaptive retransmission timeout estimation for packet video , 2006, NOSSDAV '06.

[73]  Sanjay Bhansali,et al.  Framework for instruction-level tracing and analysis of program executions , 2006, VEE '06.

[74]  Onur Mutlu,et al.  Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance , 2006, IEEE Micro.

[75]  Nick McKeown,et al.  NetFPGA: A Tool for Network Research and Education , 2006 .

[76]  Arun Venkataramani,et al.  iPlane: an information plane for distributed services , 2006, OSDI '06.

[77]  Jun Shao,et al.  A Burst Scheduling Access Reordering Mechanism , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[78]  Onur Mutlu,et al.  Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems , 2007, USENIX Security Symposium.

[79]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[80]  I. Stoica,et al.  A data-oriented (and beyond) network architecture , 2007, SIGCOMM '07.

[81]  Kiyohide Nakauchi,et al.  An explicit router feedback framework for high bandwidth-delay product networks , 2007, Comput. Networks.

[82]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[83]  Ricardo Bianchini,et al.  Limiting the power consumption of main memory , 2007, ISCA '07.

[84]  R. Govindarajan,et al.  Packet Reordering in Network Processors , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[85]  Pat Conway,et al.  The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.

[86]  Mor Harchol-Balter Special Issue on New Perspectives in Scheduling , 2007 .

[87]  Tao Li,et al.  Informed Microarchitecture Design Space Exploration Using Workload Dynamics , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[88]  Hari Balakrishnan,et al.  PPR: partial packet recovery for wireless networks , 2007, SIGCOMM '07.

[89]  Guillaume Urvoy-Keller,et al.  Scheduling in practice , 2007, PERV.

[90]  Won-Taek Lim,et al.  Effective Management of DRAM Bandwidth in Multicore Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[91]  Onur Mutlu,et al.  Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[92]  Luca De Cicco,et al.  Skype video responsiveness to bandwidth variations , 2008, NOSSDAV.

[93]  Srinivasan Seshan,et al.  Packet caches on routers: the implications of universal redundant traffic elimination , 2008, SIGCOMM '08.

[94]  Jennifer Rexford,et al.  Floodless in seattle: a scalable ethernet architecture for large enterprises , 2008, SIGCOMM '08.

[95]  Kenneth P. Birman,et al.  Maelstrom: Transparent Error Correction for Lambda Networks , 2008, NSDI.

[96]  Mikko H. Lipasti,et al.  Power-Efficient DRAM Speculation , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[97]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[98]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[99]  Onur Mutlu,et al.  Distributed order scheduling and its application to multi-core dram controllers , 2008, PODC '08.

[100]  Patrick Garda,et al.  A Novel Video Packet Loss Concealment Algorithm & Real Time Implementation , 2008 .

[101]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[102]  Calvin Lin,et al.  A comprehensive approach to DRAM power management , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[103]  Natalie D. Enright Jerger,et al.  Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.

[104]  Vyas Sekar,et al.  SmartRE: an architecture for coordinated network-wide redundancy elimination , 2009, SIGCOMM '09.

[105]  Henning Schulzrinne,et al.  Performance of Video-Chat Applications under Congestion , 2009, 2009 11th IEEE International Symposium on Multimedia.

[106]  Sachin Agarwal,et al.  Rateless Coding with Feedback , 2009, IEEE INFOCOM 2009.

[107]  ともやん KVM (Kernel-based Virtual Machine) - 仮想化 , 2009 .

[108]  Wenji Wu,et al.  Sorting Reordered Packets with Interrupt Coalescing , 2009, Comput. Networks.

[109]  Minlan Yu,et al.  BUFFALO: bloom filter forwarding architecture for large organizations , 2009, CoNEXT '09.

[110]  Amin Vahdat,et al.  PortLand: a scalable fault-tolerant layer 2 data center network fabric , 2009, SIGCOMM '09.

[111]  Mostafa I. Soliman,et al.  Performance evaluation of a high throughput crypto coprocessor using VHDL , 2010, The 2010 International Conference on Computer Engineering & Systems.

[112]  Suman Banerjee,et al.  Scalable WiFi Media Delivery through Adaptive Broadcasts , 2010, NSDI.

[113]  Rob Sherwood,et al.  Can the Production Network Be the Testbed? , 2010, OSDI.

[114]  Sangjin Han,et al.  PacketShader: a GPU-accelerated software router , 2010, SIGCOMM '10.

[115]  Jue Wang,et al.  ChitChat: Making Video Chat Robust to Packet Loss , 2010 .

[116]  Ion Stoica,et al.  HTTP as the narrow waist of the future internet , 2010, Hotnets-IX.

[117]  Michalis Faloutsos,et al.  A First Step Towards Understanding Popularity in YouTube , 2010, 2010 INFOCOM IEEE Conference on Computer Communications Workshops.

[118]  Vijay Subramanian,et al.  Layered Internet Video Engineering (LIVE): Network-Assisted Bandwidth Sharing and Transient Loss Protection for Scalable Video Streaming , 2010, 2010 Proceedings IEEE INFOCOM.

[119]  Joachim Charzinski,et al.  Traffic Properties, Client Side Cachability and CDN Usage of Popular Web Sites , 2010, MMB/DFT.

[120]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[121]  Srinivasan Seshan,et al.  XIA: an architecture for an evolvable and trustworthy internet , 2011, HotNets-X.

[122]  Junda Liu,et al.  Slick packets , 2011, PERV.

[123]  Constantine Dovrolis,et al.  The evolution of layered protocol stacks leads to an hourglass-shaped architecture , 2011, SIGCOMM.

[124]  Dipankar Raychaudhuri,et al.  MobilityFirst future internet architecture project , 2011, AINTEC '11.

[125]  Ankit Singla,et al.  Intelligent design enables architectural evolution , 2011, HotNets-X.

[126]  Patrick Agyapong,et al.  Economic Incentives in Content-Centric Networking: Implications for Protocol Design and Public Policy , 2011 .

[127]  Dina Katabi,et al.  A cross-layer design for scalable mobile video , 2011, MobiCom.

[128]  Michael J. Freedman,et al.  Serval: An End-Host Stack for Service-Centric Networking , 2012, NSDI.

[129]  Magic Quadrant for WAN Optimization Controllers , 2012 .

[130]  Van Jacobson,et al.  Networking named content , 2009, CoNEXT '09.

[131]  Srinivasan Seshan,et al.  RPT: Re-architecting Loss Protection for Content-Aware Networks , 2012, NSDI.