Redefining Data Locality for Cross-Data Center Storage

Many Cloud applications exploit the diversity of storage options in a data center to achieve desired cost, performance, and durability tradeoffs. It is common to see applications using a combination of memory, local disk, and archival storage tiers within a single data center to meet their needs. For example, hot data can be kept in memory using ElastiCache, and colder data in cheaper, slower storage such as S3, using Amazon as an example. For user-facing applications, a recent trend is to exploit multiple data centers for data placement to enable better latency of access from users to their data. The conventional wisdom is that co-location of computation and storage within the same data center is a key to application performance, so that applications running within a data center are often still limited to access local data. In this paper, using experiments on Amazon, Microsoft, and Google clouds, we show that this assumption is false, and that accessing data in nearby data centers may be faster than local access at different or even same points in the storage hierarchy. This can lead to not only better performance, but also reduced cost, simpler consistency policies and reconsidering data locality in multiple DCs environment. This argues for an expansion of cloud storage tiers to consider non-local storage options, and has interesting implications for the design of a distributed storage system.

[1]  Prashant J. Shenoy,et al.  BenchLab: An Open Testbed for Realistic Benchmarking of Web Applications , 2011, WebApps.

[2]  Ethan Katz-Bassett,et al.  CSPAN: cost-effective geo-replicated storage spanning multiple cloud services , 2013, SIGCOMM.

[3]  Antti Ylä-Jääski,et al.  Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2 , 2012, HotCloud.

[4]  Alec Wolman,et al.  Volley: Automated Data Placement for Geo-Distributed Cloud Services , 2010, NSDI.

[5]  Michael Vrable,et al.  BlueSky: a cloud-backed file system for the enterprise , 2012, FAST.

[6]  Scott Shenker,et al.  Disk-Locality in Datacenter Computing Considered Irrelevant , 2011, HotOS.

[7]  Joonwon Lee,et al.  Workload Characterization and Performance Implications of Large-Scale Blog Servers , 2012, TWEB.

[8]  Brian F. Cooper Spanner: Google's globally-distributed database , 2013, SYSTOR '13.

[9]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[10]  Xintong Wang,et al.  Vivaldi : A Decentralized Network Coordinate System , 2016 .

[11]  Michael Dahlin,et al.  Cooperative caching: using remote client memory to improve file system performance , 1994, OSDI '94.

[12]  Shankar Pasupathy,et al.  Measurement and Analysis of Large-Scale Network File System Workloads , 2008, USENIX Annual Technical Conference.

[13]  Carey L. Williamson,et al.  Internet Web servers: workload characterization and performance implications , 1997, TNET.

[14]  Himabindu Pucha,et al.  Cost Effective Storage using Extent Based Dynamic Tiering , 2011, FAST.

[15]  Abhishek Chandra,et al.  Tiera: towards flexible multi-tiered cloud storage instances , 2014, Middleware.

[16]  Jon Howell,et al.  Flat Datacenter Storage , 2012, OSDI.

[17]  Anees Shaikh,et al.  Performance Isolation and Fairness for Multi-Tenant Cloud Storage , 2012, OSDI.

[18]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[19]  Ethan Katz-Bassett,et al.  SPANStore: cost-effective geo-replicated storage spanning multiple cloud services , 2013, SOSP.

[20]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.