End-to-End Considerations in Unification of High-Performance IO

The performance of modern distributed storage and computing frameworks considerably depends on the IO performance of the many storage and network devices involved. Fortunately, these IO devices have undergone a rapid transformation in the past decade and are now capable of delivering multi-Gigabits/sec bandwidths and ultra-low IO latencies. However, in contrast to IO devices, the performance improvements of single CPU have stalled in the same time period. Hence, the traditional notion of a single fast CPU connected to multiple slow devices no longer holds. Yet, IO stacks are still designed to optimize the CPU time by executing multiple services and routines on a fast CPU while a slow IO operation is in progress. This situation has led to a CPU-IO performance gap, where the CPU’s inability to keep up with the execution of thick software stacks and OS routines during a fast IO operation on high-performance network and storage devices limits the performance delivered to data-crunching applications. Multiple research efforts from industry as well as academia have been launched to improve this situation by reducing the hardware and software overheads by providing better IO interfaces, efficiently managing IO resources, and leveraging manycore CPUs for IO processing. However, these efforts exclusively either target the network or the storage stack but not the combination of both. In this thesis, we address this performance gap and advocate to take a holistic approach towards managing resources, data flows, and devices (network or storage) to form end-to-end data flows in a distributed setting. We first quantify the software and OS overhead in IO operations and and argue to reduce it by building upon the high-performance networking principle. The general philosophy of the principle is to recognize and separate the slow control setup from the fast data access path, and involve CPU/OS in the former only selectively in managerial roles. In the thesis framework, we extend the separation philosophy from networks to storage devices by identifying common themes in the evolution of their software stacks. After identifying common high-performance IO properties, we make a case to unify the network and storage stacks. We then design and build a proof of concept FlashNet, a unified software IO stack that uses highperformance networking abstractions and semantics to access remote storage. In accordance with the original separation philosophy, FlashNet allows the allocation and translation of both

[1]  David P. Anderson,et al.  The performance of message‐passing using restricted virtual memory remapping , 1991, Softw. Pract. Exp..

[2]  Chris Maeda,et al.  Networking performance for microkernels , 1992, [1992] Proceedings Third Workshop on Workstation Operating Systems.

[3]  Keir Fraser,et al.  Arsenic: a user-accessible gigabit Ethernet interface , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[4]  Vivek S. Pai,et al.  SSDAlloc: Hybrid SSD/RAM Memory Management Made Easy , 2011, NSDI.

[5]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[6]  Mahadev Satyanarayanan,et al.  The ITC distributed file system: principles and design , 1985, SOSP 1985.

[7]  William J. Bolosky,et al.  Mach: A New Kernel Foundation for UNIX Development , 1986, USENIX Summer.

[8]  Animesh Trivedi,et al.  jVerbs: ultra-low latency for data center applications , 2013, SoCC.

[9]  David R. Cheriton,et al.  Improving Server Application Performance via Pure TCP ACK Receive Optimization , 2013, USENIX Annual Technical Conference.

[10]  J. Howard Et El,et al.  Scale and performance in a distributed file system , 1988 .

[11]  J.M. Smith,et al.  Giving applications access to Gb/s networking , 1993, IEEE Network.

[12]  J. Larus,et al.  Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[13]  Kenneth C. Knowlton,et al.  A fast storage allocator , 1965, CACM.

[14]  David L. Black,et al.  IANA Registries for the Remote Direct Data Placement (RDDP) Protocols , 2012, RFC.

[15]  Brian N. Bershad,et al.  Extensibility safety and performance in the SPIN operating system , 1995, SOSP.

[16]  Luigi Rizzo,et al.  netmap: A Novel Framework for Fast Packet I/O , 2012, USENIX ATC.

[17]  Chris I. Dalton,et al.  User-space protocols deliver high performance to applications on a low-cost Gb/s LAN , 1994, SIGCOMM 1994.

[18]  Ronald B. Brightwell,et al.  Scalability limitations of VIA-based technologies in supporting MPI , 2000 .

[19]  Jeffrey C. Mogul,et al.  TCP Offload Is a Dumb Idea Whose Time Has Come , 2003, HotOS.

[20]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Dhabaleswar K. Panda,et al.  Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[22]  Michael M. Swift,et al.  Hathi: durable transactions for memory using flash , 2012, DaMoN '12.

[23]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[24]  Rajesh K. Gupta,et al.  Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[25]  Brian N. Bershad,et al.  An I/O System for Mach 3.0 , 1991, USENIX MACH Symposium.

[26]  Thorsten von Eicken,et al.  Incorporating Memory Management into User-Level Network Interfaces , 1997 .

[27]  Animesh Trivedi,et al.  A case for RDMA in clouds: turning supercomputer networking into commodity , 2011, APSys.

[28]  L. Grossman Large Receive Offload implementation in Neterion 10GbE Ethernet driver , 2010 .

[29]  Dutch T. Meyer,et al.  Strata: scalable high-performance storage on virtualized non-volatile memory , 2014, FAST.

[30]  Mahadev Satyanarayanan,et al.  Lightweight Recoverable Virtual Memory , 1993, SOSP.

[31]  Roy H. Campbell,et al.  Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory , 2011, FAST.

[32]  Peter M. Chen,et al.  Free transactions with Rio Vista , 1997, SOSP.

[33]  Trevor N. Mudge,et al.  FlashCache: a NAND flash memory file cache for low power web servers , 2006, CASES '06.

[34]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX ATC.

[35]  Sanjay Kumar,et al.  System software for persistent memory , 2014, EuroSys '14.

[36]  Animesh Trivedi,et al.  Wimpy Nodes with 10GbE: Leveraging One-Sided Operations in Soft-RDMA to Boost Memcached , 2012, USENIX ATC.

[37]  Torsten Hoefler,et al.  DARE: High-Performance State Machine Replication on RDMA Networks , 2015, HPDC.

[38]  C. Dalton,et al.  Afterburner (network-independent card for protocols) , 1993, IEEE Network.

[39]  Laxmi N. Bhuyan,et al.  A new server I/O architecture for high speed networks , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[40]  Shimin Chen,et al.  FlashLogging: exploiting flash devices for synchronous logging performance , 2009, SIGMOD Conference.

[41]  Bruce S. Davie A host-network interface architecture for ATM , 1991, SIGCOMM '91.

[42]  Srihari Makineni,et al.  Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[43]  Jialin Li,et al.  Towards High-Performance Application-Level Storage Management , 2014, HotStorage.

[44]  Scott Rixner,et al.  An efficient programmable 10 gigabit Ethernet network interface card , 2005, 11th International Symposium on High-Performance Computer Architecture.

[45]  Derek McAuley,et al.  Protocol and Interface for ATM LANs , 1994, J. High Speed Networks.

[46]  C. C. Feldmeier Multiplexing issues in communication system design , 1990, SIGCOMM 1990.

[47]  Rajesh Gupta,et al.  From ARIES to MARS: transaction support for next-generation, solid-state drives , 2013, SOSP.

[48]  Peter Druschel,et al.  Lazy receiver processing (LRP): a network subsystem architecture for server systems , 1996, OSDI '96.

[49]  Arun Jagatheesan,et al.  Understanding the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[50]  Dhabaleswar K. Panda,et al.  Efficient virtual interface architecture (VIA) support for the IBM SP switch-connected NT clusters , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[51]  Pradeep Dubey,et al.  Architecting to achieve a billion requests per second throughput on a single key-value store server platform , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[52]  Milon Mackey,et al.  An implementation of the Hamlyn sender-managed interface architecture , 1996, OSDI '96.

[53]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[54]  Hemal Shah,et al.  Direct Data Placement over Reliable Transports , 2007, RFC.

[55]  Jonathan M. Smith,et al.  Hardware/Software Organization of a High-Performance ATM Host Interface , 1993, IEEE J. Sel. Areas Commun..

[56]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[57]  Greg J. Regnier,et al.  TCP onloading for data center servers , 2004, Computer.

[58]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[59]  Michael M. Swift,et al.  FlashVM: Virtual Memory Management on Flash , 2010, USENIX Annual Technical Conference.

[60]  David Banks,et al.  A High-Performance Network Architecture for a PA-RISC Workstation , 1993, IEEE J. Sel. Areas Commun..

[61]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[62]  Michael M. Swift,et al.  Aerie: flexible file-system interfaces to storage-class memory , 2014, EuroSys '14.

[63]  Willy Zwaenepoel,et al.  Optimizing TCP Receive Performance , 2008, USENIX ATC.

[64]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[65]  Andrea C. Arpaci-Dusseau,et al.  Transforming policies into mechanisms with infokernel , 2003, SOSP '03.

[66]  David D. Clark,et al.  Architectural considerations for a new generation of protocols , 1990, SIGCOMM '90.

[67]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[68]  Greg J. Regnier,et al.  TCP performance re-visited , 2003, 2003 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2003..

[69]  Larry L. Peterson,et al.  RPC in the x-Kernel: evaluating new design techniques , 1989, SOSP '89.

[70]  Sandia Report,et al.  The Portals 4.0 Message Passing Interface , 2008 .

[71]  Willy Zwaenepoel,et al.  IO-Lite: a unified I/O buffering and caching system , 1999, TOCS.

[72]  Mark Silberstein,et al.  GPUnet , 2014, OSDI.

[73]  G. Chesson,et al.  Protocol engine design , 1988 .

[74]  Steven Swanson,et al.  QuickSAN: a storage area network for fast, distributed, solid state disks , 2013, ISCA.

[75]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[76]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[77]  Shekhar Y. Borkar,et al.  Supporting systolic and memory communication in iWarp , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[78]  Jae-Myung Kim,et al.  A case for flash memory ssd in enterprise database applications , 2008, SIGMOD Conference.

[79]  Ravishankar K. Iyer,et al.  Addressing TCP/IP processing challenges using the IA and IXP processors , 2003 .

[80]  David L Tennenhouse Layered Multiplexing Considered Harmful , 2008 .

[81]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[82]  Christian F. Tschudin,et al.  Flexible protocol stacks , 1991, SIGCOMM '91.

[83]  Ronald G. Dreslinski,et al.  Performance analysis of system overheads in TCP/IP workloads , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[84]  Russel Sandberg,et al.  The Sun Network Filesystem: Design, Implementation and Experience , 2001 .

[85]  Andrea C. Arpaci-Dusseau,et al.  Deploying Safe User-Level Network Services with icTCP , 2004, OSDI.

[86]  Larry L. Peterson,et al.  Making paths explicit in the Scout operating system , 1996, OSDI '96.

[87]  Katerina J. Argyraki,et al.  RouteBricks: exploiting parallelism to scale software routers , 2009, SOSP '09.

[88]  Margo I. Seltzer,et al.  Structure and Performance of the Direct Access File System , 2002, USENIX ATC, General Track.

[89]  Thomas R. Gross,et al.  RStore: A Direct-Access DRAM-based Data Store , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[90]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[91]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[92]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[93]  Peter Steenkiste A systematic approach to host interface design for high-speed networks , 1994, Computer.

[94]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[95]  Sayantan Sur,et al.  Early Evaluation of Scalable Fabric Interface for PGAS Programming Models , 2014, PGAS.

[96]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[97]  David Flynn,et al.  DFS: A file system for virtualized flash storage , 2010, TOS.

[98]  Ricardo Bianchini,et al.  The MIT Alewife machine: architecture and performance , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[99]  Amin Vahdat,et al.  Chronos: predictable low latency for data center applications , 2012, SoCC '12.

[100]  K. K. Ramakrishnan,et al.  Eliminating receive livelock in an interrupt-driven kernel , 1996, TOCS.

[101]  Brian Zill,et al.  Protocol implementation on the Nectar Communication Processor , 1990, SIGCOMM 1990.

[102]  Philippe Bonnet,et al.  I/O Speculation for the Microsecond Era , 2014, USENIX Annual Technical Conference.

[103]  Jeffrey S. Chase,et al.  End system optimizations for high-speed TCP , 2001, IEEE Commun. Mag..

[104]  David Woodhouse,et al.  JFFS : The Journalling Flash File System , 2001 .

[105]  Jon Howell,et al.  Flat Datacenter Storage , 2012, OSDI.

[106]  Michael Burrows,et al.  Performance of Firefly RPC , 1990, ACM Trans. Comput. Syst..

[107]  David R. Cheriton,et al.  Software-Controlled Caches in the VMP Multiprocessor , 1986, ISCA.

[108]  Eric A. Brewer,et al.  Remote queues: exposing message queues for optimization and atomicity , 1995, SPAA '95.

[109]  John Wilkes Hamlyn — an interface for sender- based communications , 1992 .

[110]  Edoardo Biagioni A structured TCP in standard ML. , 1994, SIGCOMM 1994.

[111]  John K. Ousterhout,et al.  Why Aren't Operating Systems Getting Faster As Fast as Hardware? , 1990, USENIX Summer.

[112]  Steven Swanson,et al.  Refactor, Reduce, Recycle: Restructuring the I/O Stack for the Future of Storage , 2013, Computer.

[113]  Renato Recio,et al.  A Remote Direct Memory Access Protocol Specification , 2007, RFC.

[114]  Philippe Bonnet,et al.  Linux block IO: introducing multi-queue SSD access on multi-core systems , 2013, SYSTOR '13.

[115]  Rajesh K. Gupta,et al.  NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories , 2011, ASPLOS XVI.

[116]  Calton Pu,et al.  High Performance Sockets and RPC over Virtual Interface (VI) Architecture , 1999, CANPC.

[117]  Alan L. Cox,et al.  An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems , 2006, USENIX Annual Technical Conference, General Track.

[118]  Peter Desnoyers,et al.  Analytic Models of SSD Write Performance , 2014, TOS.

[119]  Margo I. Seltzer,et al.  Making the Most Out of Direct-Access Network Attached Storage , 2003, FAST.

[120]  Ali G. Saidi,et al.  Integrated network interfaces for high-bandwidth TCP/IP , 2006, ASPLOS XII.

[121]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[122]  Erich M. Nahum,et al.  Cache behavior of network protocols , 1997, SIGMETRICS '97.

[123]  Paul E. McKenney,et al.  Efficient demultiplexing of incoming TCP packets , 1992, SIGCOMM 1992.

[124]  P. Druschel,et al.  Soft timers: efficient microsecond software timer support for network processing , 2000, OPSR.

[125]  Rajesh K. Gupta,et al.  Onyx: A Prototype Phase Change Memory Storage Array , 2011, HotStorage.

[126]  James Pinkerton,et al.  Direct Data Placement Protocol (DDP) / Remote Direct Memory Access Protocol (RDMAP) Security , 2007, RFC.

[127]  D. R. Cheriton,et al.  VMTP: Versatile Message Transaction Protocol , 1988 .

[128]  Robert Tappan Morris,et al.  Improving network connection locality on multicore systems , 2012, EuroSys '12.

[129]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[130]  Terence Kelly,et al.  Failure-atomic msync(): a simple and efficient mechanism for preserving the integrity of durable data , 2013, EuroSys '13.

[131]  Evangelos P. Markatos,et al.  Speeding up TCP/IP: faster processors are not enough , 2002, Conference Proceedings of the IEEE International Performance, Computing, and Communications Conference (Cat. No.02CH37326).

[132]  Sylvia Ratnasamy,et al.  SoftNIC: A Software NIC to Augment Hardware , 2015 .

[133]  Luis Ceze,et al.  Operating System Implications of Fast, Cheap, Non-Volatile Memory , 2011, HotOS.

[134]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[135]  Hemal Shah,et al.  DA: Datamover Architecture for the Internet Small Computer System Interface (iSCSI) , 2007, RFC.

[136]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[137]  Brian N. Bershad,et al.  An Extensible Protocol Architecture for Application-Specific Networking , 1996, USENIX Annual Technical Conference.

[138]  Michael M. Swift,et al.  FlashVM: Revisiting the Virtual Memory Hierarchy , 2009, HotOS.

[139]  Michael M. Swift,et al.  FlashTier: a lightweight, consistent and durable storage cache , 2012, EuroSys '12.

[140]  Haralampos Pozidis,et al.  Trends in Storage Technologies , 2010, IEEE Data Eng. Bull..

[141]  Babak Falsafi,et al.  Coherent Network Interfaces for Fine-Grain Communication , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[142]  Robert B. Ross,et al.  Distributing the Data Plane for Remote Storage Access , 2015, HotOS.

[143]  Hiroshi Motoda,et al.  A Flash-Memory Based File System , 1995, USENIX.

[144]  Henry M. Levy,et al.  Limits to low-latency communication on high-speed networks , 1993, TOCS.

[145]  Renato John Recio Server I/O networks past, present, and future , 2003, NICELI '03.

[146]  Wolfgang Rehm,et al.  Providing a High-Performance VIA-Module for LAM/MPI , 2004 .

[147]  Mendel Rosenblum,et al.  Network Interface Design for Low Latency Request-Response Protocols , 2013, USENIX ATC.

[148]  Frank Hady,et al.  When poll is better than interrupt , 2012, FAST.

[149]  Michael Stumm,et al.  Exception-Less System Calls for Event-Driven Servers , 2011, USENIX Annual Technical Conference.

[150]  Michael Wu,et al.  eNVy: a non-volatile, main memory storage system , 1994, ASPLOS VI.

[151]  Babak Falsafi,et al.  Manycore Network Interfaces for in-memory rack-scale computing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[152]  Paolo Faraboschi,et al.  Operating System Support for NVM+DRAM Hybrid Main Memory , 2009, HotOS.

[153]  Richard F. Rashid,et al.  The Integration of Virtual Memory Management and Interprocess Communication in Accent , 1986, ACM Trans. Comput. Syst..

[154]  Liviu Iftode,et al.  Software support for virtual memory-mapped communication , 1996, Proceedings of International Conference on Parallel Processing.

[155]  David J. Lilja,et al.  High performance solid state storage under Linux , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[156]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[157]  A. L. Narasimha Reddy,et al.  SCMFS: A file system for Storage Class Memory , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[158]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[159]  Gustavo Alonso,et al.  Server-efficient high-definition media dissemination , 2009, NOSSDAV '09.

[160]  Dhabaleswar K. Panda,et al.  Sockets Direct Protocol over InfiniBand in clusters: is it beneficial? , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[161]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[162]  Brian N. Bershad,et al.  Protocol service decomposition for high-performance networking , 1994, SOSP '93.

[163]  Jeffrey S. Chase,et al.  Trapeze / IP : TCP / IP at Near-Gigabit Speeds , 1999 .

[164]  Animesh Trivedi,et al.  DaRPC: Data Center RPC , 2014, SoCC.

[165]  David D. Clark,et al.  An analysis of TCP processing overhead , 1988, IEEE Communications Magazine.

[166]  Bruce Jacob,et al.  The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization , 2009, ISCA '09.

[167]  Alessandro Curioni,et al.  Rebasing I/O for Scientific Computing: Leveraging Storage Class Memory in an IBM BlueGene/Q Supercomputer , 2014, ISC.

[168]  W. Daniel Hillis,et al.  The network architecture of the Connection Machine CM-5 (extended abstract) , 1992, SPAA '92.

[169]  H. T. Kung,et al.  A Host Interface Architecture for High-Speed Networks , 1992, HPN.

[170]  Byung-Gon Chun,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 135 Megapipe: a New Programming Interface for Scalable Network I/o , 2022 .

[171]  William I. Nowicki,et al.  NFS: Network File System Protocol specification , 1989, RFC.

[172]  Mendel Rosenblum,et al.  It's Time for Low Latency , 2011, HotOS.

[173]  Richard W. Watson,et al.  Gaining efficiency in transport services by appropriate design and implementation choices , 1987, TOCS.

[174]  Willy Zwaenepoel,et al.  The peregrine high‐performance RPC system , 1993, Softw. Pract. Exp..

[175]  Thomas E. Anderson,et al.  FlexNIC: Rethinking Network DMA , 2015, HotOS.

[176]  신웅 OS I/O path optimizations for flash solid-state drives , 2017 .

[177]  Erich M. Nahum,et al.  Server Network Scalability and TCP Offload , 2005, USENIX Annual Technical Conference, General Track.

[178]  Haixun Wang,et al.  Trinity: a distributed graph engine on a memory cloud , 2013, SIGMOD '13.

[179]  Joseph Pasquale,et al.  Profiling and reducing processing overheads in TCP/IP , 1996, TNET.

[180]  Peter Druschel,et al.  Cache and TLB Effectiveness in the Processing of Network Data , 1993 .

[181]  Thomas R. Gross,et al.  Unified High-Performance I/O: One Stack to Rule Them All , 2013, HotOS.

[182]  Christopher Frost,et al.  Better I/O through byte-addressable, persistent memory , 2009, SOSP '09.

[183]  David G. Andersen,et al.  Using vector interfaces to deliver millions of IOPS from a networked key-value storage server , 2012, SoCC '12.

[184]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[185]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[186]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[187]  Eric Anderson,et al.  Efficiency matters! , 2010, OPSR.

[188]  Jim Zelenka,et al.  File server scaling with network-attached secure disks , 1997, SIGMETRICS '97.

[189]  H. T. Kung,et al.  The design of nectar: a network backplane for heterogeneous multicomputers , 1989, ASPLOS III.

[190]  Larry L. Peterson,et al.  Fbufs: a high-bandwidth cross-domain transfer facility , 1994, SOSP '93.

[191]  Steve Scott,et al.  Performance of the CRAY T3E Multiprocessor , 1997, SC.

[192]  Kai Li,et al.  Protected, user-level DMA for the SHRIMP network interface , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[193]  Joo Young Hwang,et al.  F2FS: A New File System for Flash Storage , 2015, FAST.

[194]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.

[195]  Jeffrey C. Mogul Network Locality at the Scale of Processes , 1992, ACM Trans. Comput. Syst..

[196]  Pavan Balaji,et al.  Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck , 2004 .

[197]  Michael Stumm,et al.  FlexSC: Flexible System Call Scheduling with Exception-Less System Calls , 2010, OSDI.

[198]  William Gropp,et al.  Learning from the Success of MPI , 2001, HiPC.

[199]  Gustavo Alonso,et al.  Minimizing the Hidden Cost of RDMA , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[200]  Charlie Johnson,et al.  IBM Power Edge of Network Processor: A Wire-Speed System on a Chip , 2011, IEEE Micro.

[201]  Trevor Blackwell Speeding up protocols for small messages , 1996, SIGCOMM 1996.

[202]  Muli Ben-Yehuda,et al.  IsoStack - Highly Efficient Network Processing on Dedicated Cores , 2010, USENIX Annual Technical Conference.

[203]  Eran Gabber,et al.  The Case Against User-Level Networking , 2004 .

[204]  Hemal Shah,et al.  Remote Direct Memory Access (RDMA) Protocol Extensions , 2014, RFC.

[205]  Michael M. Swift,et al.  Mnemosyne: lightweight persistent memory , 2011, ASPLOS XVI.

[206]  P. Pierce,et al.  The Paragon implementation of the NX message passing interface , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[207]  David R. Cheriton,et al.  The VMP network adapter board (NAB): high-performance network communication for multiprocessors , 1988, SIGCOMM 1988.

[208]  Thu D. Nguyen,et al.  Implementing network protocols at user level , 1993, TNET.

[209]  Dana S. Henry,et al.  A tightly-coupled processor-network interface , 1992, ASPLOS V.

[210]  José Carlos Brustoloni,et al.  Effects of buffering semantics on I/O performance , 1996, OSDI '96.

[211]  David E. Culler,et al.  High-performance local area communication with fast sockets , 1997 .

[212]  Larry L. Peterson,et al.  The x-Kernel: An Architecture for Implementing Network Protocols , 1991, IEEE Trans. Software Eng..

[213]  Brian Zill,et al.  Software support for outboard buffering and checksumming , 1995, SIGCOMM '95.

[214]  Christopher R. Johnson,et al.  PIKA: A Network Service for Multikernel Operating Systems , 2014 .

[215]  Brent Callaghan,et al.  NFS over RDMA , 2003, NICELI '03.

[216]  Larry L. Peterson,et al.  Design of the x-kernel , 1988, SIGCOMM '88.

[217]  Kai Li,et al.  Storage alternatives for mobile computers , 1994, OSDI '94.

[218]  Irfan Ahmad,et al.  vIC: Interrupt Coalescing for Virtual Machine Storage Device IO , 2011, USENIX Annual Technical Conference.

[219]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[220]  Peter F. Corbett,et al.  The Direct Access File System , 2003, FAST.

[221]  Henry M. Levy,et al.  Separating data and control transfer in distributed operating systems , 1994, ASPLOS VI.

[222]  Bingsheng He,et al.  NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems , 2015, FAST.

[223]  A. Gupta,et al.  The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[224]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX ATC.

[225]  Hsiao-Keng Jerry Chu,et al.  Zero-Copy TCP in Solaris , 1996, USENIX Annual Technical Conference.

[226]  Aled Edwards,et al.  Experiences implementing a high performance TCP in user-space , 1995, SIGCOMM '95.

[227]  Torsten Hoefler,et al.  Remote Memory Access Programming in MPI-3 , 2015, TOPC.

[228]  David D. Clark,et al.  The structuring of systems using upcalls , 1985, SOSP '85.

[229]  Anthony Skjellum,et al.  Design, implementation, and performance evaluation of MPI 3.0 on portals 4.0 , 2013, EuroMPI.

[230]  Heon Young Yeom,et al.  Dynamic Interval Polling and Pipelined Post I/O Processing for Low-Latency Storage Class Memory , 2013, HotStorage.

[231]  Steven Swanson,et al.  Providing safe, user space access to fast, solid state disks , 2012, ASPLOS XVII.

[232]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[233]  Katherine Yelick,et al.  Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT , 2009 .

[234]  Jian Xu,et al.  Bankshot: caching slow storage in fast non-volatile memory , 2013, INFLOW '13.

[235]  Sayantan Sur,et al.  A Brief Introduction to the OpenFabrics Interfaces - A New Network API for Maximizing High Performance Application Efficiency , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[236]  Randall R. Stewart,et al.  Stream Control Transmission Protocol (SCTP) Direct Data Placement (DDP) Adaptation , 2007, RFC.

[237]  William J. Dally,et al.  The J-machine Multicomputer: An Architectural Evaluation , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[238]  Peter Druschel,et al.  Experiences with a high-speed network adaptor: a software perspective , 1994, SIGCOMM 1994.

[239]  Ian Watson,et al.  The Manchester prototype dataflow computer , 1985, CACM.

[240]  Mark Handley,et al.  Network stack specialization for performance , 2015, SIGCOMM 2015.

[241]  Larry L. Peterson,et al.  A language-based approach to protocol implementation , 1993, TNET.

[242]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[243]  Timothy Roscoe,et al.  Modeling NICs with Unicorn , 2013, PLOS '13.

[244]  Richard P. Martin,et al.  Effects Of Communication Latency, Overhead, And Bandwidth In A Cluster Architecture , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[245]  Arkady Kanevsky,et al.  Enhanced Remote Direct Memory Access (RDMA) Connection Establishment , 2012, RFC.

[246]  Thomas L. Sterling,et al.  BEOWULF: A Parallel Workstation for Scientific Computation , 1995, ICPP.

[247]  Jeffrey S. Chase,et al.  On the elusive benefits of protocol offload , 2003, NICELI '03.

[248]  Thomas F. Wenisch,et al.  Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.

[249]  Evangelos Eleftheriou,et al.  Container Marking: Combining Data Placement, Garbage Collection and Wear Levelling for Flash , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[250]  Michael M. Swift,et al.  Storage-class memory needs flexible interfaces , 2013, APSys.

[251]  Ashish Gupta,et al.  The RAMCloud Storage System , 2015, ACM Trans. Comput. Syst..

[252]  Joel Dylan Coburn Providing fast and safe access to next-generation, non- volatile memories , 2012 .

[253]  Todor I. Mollov,et al.  Quill : Exploiting Fast Non-Volatile Memory by Transparently Bypassing the File System , 2013 .

[254]  K. K. Ramakrishnan,et al.  Performance Considerations in Designing Network Interfaces , 1993, IEEE J. Sel. Areas Commun..

[255]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[256]  Joseph Pasquale,et al.  The importance of non-data touching processing overheads in TCP/IP , 1993, SIGCOMM 1993.

[257]  David E. Culler,et al.  An Implementation and Analysis of the Virtual Interface Architecture , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[258]  Eddie Kohler,et al.  A readable TCP in the Prolac protocol language , 1999, SIGCOMM '99.

[259]  George Bosilca,et al.  UCX: An Open Source Framework for HPC Network APIs and Beyond , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[260]  Steven Swanson,et al.  Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications , 2009, ASPLOS.

[261]  GhemawatSanjay,et al.  The Google file system , 2003 .

[262]  Ren-Shuo Liu,et al.  NVM duet: unified working memory and persistent store architecture , 2014, ASPLOS.

[263]  Robert Grimm,et al.  Application performance and flexibility on exokernel systems , 1997, SOSP.

[264]  Andrea C. Arpaci-Dusseau,et al.  De-indirection for flash-based SSDs with nameless writes , 2012, FAST.

[265]  Jeffrey C. Mogul,et al.  The packer filter: an efficient mechanism for user-level network code , 1987, SOSP '87.

[266]  Orion Hodson,et al.  Whole-system persistence , 2012, ASPLOS XVII.

[267]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[268]  F. Bitz,et al.  Host interface design for ATM LANs , 1991, [1991] Proceedings 16th Conference on Local Computer Networks.

[269]  Jian Yang,et al.  Mojim: A Reliable and Highly-Available Non-Volatile Memory System , 2015, ASPLOS.

[270]  Babak Falsafi,et al.  Scale-out NUMA , 2014, ASPLOS.

[271]  Harry Rudin,et al.  A Survey of Light-Weight Protocols for High-Speed Networks , 1994 .

[272]  Jeffrey C. Mogul,et al.  The effect of context switches on cache performance , 1991, ASPLOS IV.

[273]  Ram Huggahalli,et al.  Direct cache access for high bandwidth network I/O , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[274]  Timothy Roscoe,et al.  Arrakis , 2014, OSDI.

[275]  Dhabaleswar K. Panda,et al.  Beyond block I/O: Rethinking traditional storage primitives , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[276]  Jens Teubner,et al.  A Spinning Join That Does Not Get Dizzy , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[277]  David G. Andersen,et al.  The Case for VOS: The Vector Operating System , 2011, HotOS.

[278]  Hyeonsang Eom,et al.  Optimizing the Block I/O Subsystem for Fast Storage Devices , 2014, ACM Trans. Comput. Syst..

[279]  Dhruva R. Chakrabarti,et al.  Implications of CPU Caching on Byte-addressable Non-Volatile Memory Programming , 2012 .

[280]  Qin Jin,et al.  Persistent B+-Trees in Non-Volatile Main Memory , 2015, Proc. VLDB Endow..

[281]  Andrea C. Arpaci-Dusseau,et al.  ANViL: Advanced Virtualization for Modern Non-Volatile Memory Devices , 2015, FAST.

[282]  Antony I. T. Rowstron,et al.  IOFlow: a software-defined storage architecture , 2013, SOSP.

[283]  Dahlia Malkhi,et al.  CORFU: A Shared Log Design for Flash Clusters , 2012, NSDI.

[284]  Philip Werner Frey,et al.  Zero-copy network communication: An applicability study of iWARP beyond micro benchmarks , 2010 .

[285]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.