Enhancing Server Efficiency in the Face of Killer Microseconds
暂无分享,去创建一个
Thomas F. Wenisch | Amirhossein Mirhosseini | Akshitha Sriraman | T. Wenisch | Amirhossein Mirhosseini | Akshitha Sriraman
[1] Yan Solihin,et al. QoS policies and architecture for cache/memory in CMP platforms , 2007, SIGMETRICS '07.
[2] Norman P. Jouppi,et al. Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.
[3] Kevin Skadron,et al. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[4] Thomas F. Wenisch,et al. Memory persistency , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[5] David G. Andersen,et al. Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.
[6] Trevor N. Mudge,et al. Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments , 2008, 2008 International Symposium on Computer Architecture.
[7] David A. Wood,et al. WiDGET: Wisconsin decoupled grid execution tiles , 2010, ISCA.
[8] O. Mutlu,et al. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS XV.
[9] Thomas F. Wenisch,et al. µTune: Auto-Tuned Threading for OLDI Microservices , 2018, OSDI.
[10] Donald Yeung,et al. Transparent threads: resource sharing in SMT processors for high single-thread performance , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.
[11] Pradip Bose,et al. SMT-centric power-aware thread placement in chip multiprocessors , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[12] Brad Calder,et al. Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).
[13] Mattan Erez,et al. Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems , 2016, ASPLOS.
[14] Margaret Martonosi,et al. Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[15] Xiaosong Ma,et al. KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[16] Francisco J. Cazorla,et al. Predictable performance in SMT processors: synergy between the OS and SMTs , 2006, IEEE Transactions on Computers.
[17] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[18] Hamid Sarbazi-Azad,et al. Domino Temporal Data Prefetcher , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[19] Michael Ferdman,et al. Taming the Killer Microsecond , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[20] Hari Angepat,et al. A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[21] Munira Hussain,et al. Need for Speed : Comparing FDR and EDR InfiniBand ( Part 2 ) , 2018 .
[22] Christina Delimitrou,et al. The Architectural Implications of Cloud Microservices , 2018, IEEE Computer Architecture Letters.
[23] Babak Falsafi,et al. Scale-out NUMA , 2014, ASPLOS.
[24] Christoforos E. Kozyrakis,et al. From chaos to QoS: case studies in CMP resource management , 2007, CARN.
[25] David G. Lowe,et al. Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[26] Asit K. Mishra,et al. METE: meeting end-to-end QoS in multicores through system-wide resource management , 2011, PERV.
[27] Brahim Medjahed,et al. A Query Rewriting Approach for Web Service Composition , 2010, IEEE Transactions on Services Computing.
[28] Hamid Sarbazi-Azad,et al. BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems , 2017, 2017 Eleventh IEEE/ACM International Symposium on Networks-on-Chip (NOCS).
[29] Phil Rogers,et al. Heterogeneous system architecture overview , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).
[30] Dan Tsafrir,et al. The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops) , 2007, ExpCS '07.
[31] Brad Fitzpatrick,et al. Distributed caching with memcached , 2004 .
[32] Sai Prashanth Muralidhara,et al. Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[33] Gu-Yeon Wei,et al. Quantifying sources of error in McPAT and potential impacts on architectural studies , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[34] Satish Narayanasamy,et al. Language-level persistency , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[35] James E. Smith,et al. Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[36] Daniel Sánchez,et al. Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).
[37] Christina Delimitrou,et al. Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.
[38] David Wentzlaff,et al. MITTS: Memory Inter-arrival Time Traffic Shaping , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[39] Thomas F. Wenisch,et al. Deconstructing the Tail at Scale Effect Across Network Protocols , 2017, ArXiv.
[40] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.
[41] Jeffrey H. Meyerson,et al. The Go Programming Language , 2014, IEEE Softw..
[42] Kang G. Shin,et al. Efficient Memory Disaggregation with Infiniswap , 2017, NSDI.
[43] Jin-Soo Kim,et al. NVMeDirect: A User-space I/O Framework for Application-specific Optimization on NVMe SSDs , 2016, HotStorage.
[44] Satish Narayanasamy,et al. Persistency for synchronization-free regions , 2018, PLDI.
[45] Norman P. Jouppi,et al. Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).
[46] David A. Patterson,et al. Attack of the killer microseconds , 2017, Commun. ACM.
[47] David G. Andersen,et al. Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.
[48] Tom White,et al. Hadoop: The Definitive Guide , 2009 .
[49] Nam Sung Kim,et al. GPU register file virtualization , 2015, MICRO.
[50] Hamid Sarbazi-Azad,et al. LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching , 2018, ASPLOS.
[51] Gabriel H. Loh,et al. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.
[52] Josep Torrellas,et al. Survive: Pointer-Based In-DRAM Incremental Checkpointing for Low-Cost Data Persistence and Rollback-Recovery , 2017, IEEE Computer Architecture Letters.
[53] Chen Ding,et al. Quantifying the cost of context switch , 2007, ExpCS '07.
[54] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.
[55] Gérard Boudol,et al. Fair Cooperative Multithreading , 2007, CONCUR.
[56] Christina Delimitrou,et al. Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.
[57] Ming Zhao,et al. Client-side Flash Caching for Cloud Systems , 2014, SYSTOR 2014.
[58] Thomas F. Wenisch,et al. Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[59] Zeshan Chishti,et al. Optimal Power/Performance Pipeline Depth for SMT in Scaled Technologies , 2008, IEEE Transactions on Computers.
[60] Junjie Wu,et al. BigHouse: A simulation infrastructure for data center systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.
[61] Vivek Sarkar,et al. RegMutex: Inter-Warp GPU Register Time-Sharing , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[62] Christopher Frost,et al. Better I/O through byte-addressable, persistent memory , 2009, SOSP '09.
[63] Daniel Mossé,et al. Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[64] Tony Tung,et al. Scaling Memcache at Facebook , 2013, NSDI.
[65] Thomas F. Wenisch,et al. Temporal streaming of shared memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).
[66] Daniel Sánchez,et al. Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.
[67] Thomas F. Wenisch,et al. Thermostat: Application-transparent Page Management for Two-tiered Main Memory , 2017, ASPLOS.
[68] Engin Ipek,et al. Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.
[69] Thomas F. Wenisch,et al. Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).
[70] Ronald N. Kalla,et al. IBM Power9 Processor Architecture , 2017, IEEE Micro.
[71] André Seznec,et al. Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[72] Christoforos E. Kozyrakis,et al. Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[73] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.
[74] Yale N. Patt,et al. MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[75] Martin Porter,et al. Snowball: A language for stemming algorithms , 2001 .
[76] Shiliang Hu,et al. LASER: Light, Accurate Sharing dEtection and Repair , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[77] Sriram Sankar,et al. Server Engineering Insights for Large-Scale Online Services , 2010, IEEE Micro.
[78] Thomas F. Wenisch,et al. Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.
[79] Francisco J. Cazorla,et al. QoS for high-performance SMT processors in embedded systems , 2004, IEEE Micro.
[80] Thomas F. Wenisch,et al. PowerNap: eliminating server idle power , 2009, ASPLOS.
[81] Dean M. Tullsen,et al. Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.
[82] R. Govindarajan,et al. Probabilistic Shared Cache Management (PriSM) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[83] Gu-Yeon Wei,et al. Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[84] Beng Chin Ooi,et al. A Performance Study of Big Data on Small Nodes , 2015, Proc. VLDB Endow..
[85] Thomas F. Wenisch,et al. μ Suite: A Benchmark Suite for Microservices , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).
[86] Lingjia Tang,et al. SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[87] Christina Delimitrou,et al. Workload characterization of interactive cloud services on big and small server platforms , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).
[88] Scott A. Mahlke,et al. Composite Cores: Pushing Heterogeneity Into a Core , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[89] Stijn Eyerman,et al. System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.
[90] MudgeTrevor,et al. Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments , 2008 .
[91] Mahmut T. Kandemir,et al. SHARP control: Controlled shared cache management in chip multiprocessors , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[92] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..
[93] David Wentzlaff,et al. The sharing architecture: sub-core configurability for IaaS clouds , 2014, ASPLOS.
[94] Steven K. Reinhardt,et al. The impact of resource partitioning on SMT processors , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.
[95] Sanjay Kumar,et al. System software for persistent memory , 2014, EuroSys '14.
[96] Margo I. Seltzer,et al. Flash Caching on the Storage Client , 2013, USENIX Annual Technical Conference.
[97] Mor Harchol-Balter,et al. Performance Modeling and Design of Computer Systems: Queueing Theory in Action , 2013 .
[98] Miguel Castro,et al. FaRM: Fast Remote Memory , 2014, NSDI.
[99] Luiz André Barroso,et al. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.
[100] Xi Yang,et al. Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads Using Simultaneous Multithreading , 2016, USENIX Annual Technical Conference.
[101] Ravi Iyer,et al. Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[102] Dan Williams,et al. Platform Storage Performance With 3D XPoint Technology , 2017, Proceedings of the IEEE.
[103] S. Winkel. Optimal versus Heuristic Global Code Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[104] Christoforos E. Kozyrakis,et al. IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.
[105] Hamid Sarbazi-Azad,et al. Bingo Spatial Data Prefetcher , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[106] Eunyoung Jeong,et al. mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.
[107] Christoforos E. Kozyrakis,et al. Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[108] Hosung Park,et al. What is Twitter, a social network or a news media? , 2010, WWW '10.
[109] Norman P. Jouppi,et al. Heterogeneous chip multiprocessors , 2005, Computer.
[110] Rasmus Pagh,et al. Cuckoo Hashing , 2001, Encyclopedia of Algorithms.
[111] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[112] Timothy Roscoe,et al. Decoupling Cores, Kernels, and Operating Systems , 2014, OSDI.
[113] Jack L. Lo,et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[114] Lingjia Tang,et al. Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.
[115] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[116] Babak Falsafi,et al. Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[117] Raju Rangaswami,et al. Centaur: Host-Side SSD Caching for Storage Performance Control , 2015, 2015 IEEE International Conference on Autonomic Computing.
[118] James Reinders,et al. Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .
[119] Ankit Singla,et al. Enabling Efficient RDMA-based Synchronous Mirroring of Persistent Memory Transactions , 2018, ArXiv.
[120] Ronald G. Dreslinski,et al. Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[121] Thomas F. Wenisch,et al. Spatio-temporal memory streaming , 2009, ISCA '09.
[122] Urs Hölzle,et al. Brawny cores still beat wimpy cores, most of the time , 2010 .
[123] Thu D. Nguyen,et al. Exploiting Heterogeneity for Tail Latency and Energy Efficiency , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[124] Thomas F. Wenisch,et al. Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[125] Christina Delimitrou,et al. Amdahl's law for tail latency , 2018, Commun. ACM.
[126] Edouard Bugnion,et al. ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.
[127] Laxmi N. Bhuyan,et al. μDPM: Dynamic Power Management for the Microsecond Era , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[128] Christoforos E. Kozyrakis,et al. ReFlex: Remote Flash ≈ Local Flash , 2017, ASPLOS.
[129] Christoforos E. Kozyrakis,et al. Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[130] Daniel Sánchez,et al. Rubik: Fast analytical power management for latency-critical systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[131] Jinyang Li,et al. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.
[132] Carsten Binnig,et al. The End of Slow Networks: It's Time for a Redesign , 2015, Proc. VLDB Endow..
[133] Stijn Eyerman,et al. Probabilistic job symbiosis modeling for SMT processor scheduling , 2010, ASPLOS XV.
[134] Marcos K. Aguilera,et al. Remote memory in the age of fast networks , 2017, SoCC.
[135] Babak Falsafi,et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.
[136] Nael B. Abu-Ghazaleh,et al. CORF: Coalescing Operand Register File for GPUs , 2019, ASPLOS.
[137] Kushagra Vaid,et al. Web search using mobile cores: quantifying and mitigating the price of efficiency , 2010, ISCA.
[138] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.
[139] Steve Byan,et al. Mercury: Host-side flash caching for the data center , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).
[140] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.
[141] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .
[142] Onur Mutlu,et al. MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices , 2018, FAST.
[143] Bin Fan,et al. MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.