FaRM: Fast Remote Memory

We describe the design and implementation of FaRM, a new main memory distributed computing platform that exploits RDMA to improve both latency and throughput by an order of magnitude relative to state of the art main memory systems that use TCP/IP. FaRM exposes the memory of machines in the cluster as a shared address space. Applications can use transactions to allocate, read, write, and free objects in the address space with location transparency. We expect this simple programming model to be sufficient for most application code. FaRM provides two mechanisms to improve performance where required: lock-free reads over RDMA, and support for collocating objects and function shipping to enable the use of efficient single machine transactions. FaRM uses RDMA both to directly access data in the shared address space and for fast messaging and is carefully tuned for the best RDMA performance. We used FaRM to build a key-value store and a graph store similar to Facebook's. They both perform well, for example, a 20-machine cluster can perform 167 million key-value lookups per second with a latency of 31µs.

[1]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[2]  Ravi Sethi,et al.  Useless Actions Make a Difference: Strict Serializability of Database Updates , 1982, JACM.

[3]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[4]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[5]  David R. Cheriton,et al.  The synergy between non-blocking synchronization and operating system structure , 1996, OSDI '96.

[6]  Alan L. Cox,et al.  ThreadMarks: Shared Memory Computing on Networks of Workstations , 1996, Computer.

[7]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[8]  Srinivasan Parthasarathy,et al.  Cashmere-2L: software coherent shared memory on a clustered remote-write network , 1997, SOSP.

[9]  Katherine Yelick,et al.  Titanium: a high-performance Java dialect , 1998 .

[10]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[11]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[12]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[13]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[14]  Dhabaleswar K. Panda,et al.  High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.

[15]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[16]  Dhabaleswar K. Panda,et al.  PVFS over InfiniBand: design and performance evaluation , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[17]  Keir Fraser,et al.  Practical lock-freedom , 2003 .

[18]  Anjali Gupta,et al.  Efficient Routing for Peer-to-Peer Overlays , 2004, NSDI.

[19]  Dhabaleswar K. Panda,et al.  Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[20]  Dhabaleswar K. Panda,et al.  High performance implementation of MPI derived datatype communication over InfiniBand , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[21]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[22]  W. Zwaenepoel,et al.  Shared Memory Computing on Networks of Workstations , 2004 .

[23]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[24]  Galen M. Shipman,et al.  Infiniband scalability in Open MPI , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[25]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[26]  C. H. Flood,et al.  The Fortress Language Specification , 2007 .

[27]  Marcos K. Aguilera,et al.  Sinfonia: a new paradigm for building scalable distributed systems , 2007, SOSP.

[28]  Maurice Herlihy,et al.  Hopscotch Hashing , 2008, DISC.

[29]  Gadi Taubenfeld Proceedings of the 22nd international symposium on Distributed Computing , 2008 .

[30]  Dan Meng,et al.  Early Experiences with Write-Write Design of NFS over RDMA , 2009, 2009 IEEE International Conference on Networking, Architecture, and Storage.

[31]  Dilma Da Silva,et al.  Providing a cloud network infrastructure on a supercomputer , 2010, HPDC '10.

[32]  Surajit Chaudhuri,et al.  Proceedings of the 11th ACM Symposium on Cloud Computing , 2010 .

[33]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[34]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX Annual Technical Conference.

[35]  Nir Shavit,et al.  Flat combining and the synchronization-parallelism tradeoff , 2010, SPAA '10.

[36]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[37]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[38]  Dhabaleswar K. Panda,et al.  Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[39]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[40]  Dhabaleswar K. Panda,et al.  Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  Dhabaleswar K. Panda,et al.  High-Performance Design of HBase with RDMA over InfiniBand , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[42]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[44]  Bin Fan,et al.  MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.

[45]  Timothy G. Armstrong,et al.  LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[46]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[47]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX Annual Technical Conference.

[48]  Torsten Hoefler,et al.  Remote Memory Access Programming in MPI-3 , 2015, TOPC.