FAST CLOUD: Pushing the Envelope on Delay Performance of Cloud Storage With Coding

Our paper presents solutions that can significantly improve the delay performance of putting and retrieving data in and out of cloud storage. We first focus on measuring the delay performance of a very popular cloud storage service Amazon S3. We establish that there is significant randomness in service times for reading and writing small and medium size objects when assigned distinct keys. We further demonstrate that using erasure coding, parallel connections to storage cloud and limited chunking (i.e., dividing the object into a few smaller objects) together pushes the envelope on service time distributions significantly (e.g., 76%, 80%, and 85% reductions in mean, 90th, and 99th percentiles for 2-MB files) at the expense of additional storage (e.g., 1.75$\times$). However, chunking and erasure coding increase the load and hence the queuing delays while reducing the supportable rate region in number of requests per second per node. Thus, in the second part of our paper, we focus on analyzing the delay performance when chunking, forward error correction (FEC), and parallel connections are used together. Based on this analysis, we develop load-adaptive algorithms that can pick the best code rate on a per-request basis by using offline computed queue backlog thresholds. The solutions work with homogeneous services with fixed object sizes, chunk sizes, operation type (e.g., read or write) as well as heterogeneous services with mixture of object sizes, chunk sizes, and operation types. We also present a simple greedy solution that opportunistically uses idle connections and picks the erasure coding rate accordingly on the fly. Both backlog-based and greedy solutions support the full rate region and provide best mean delay performance when compared to the best fixed coding rate policy. Our evaluations show that backlog-based solutions achieve better delay performance at higher percentile values than the greedy solution.

[1]  L. Litwin,et al.  Error control coding , 2001 .

[2]  Koushik Kar,et al.  MPLOT: A Transport Protocol Exploiting Multipath Diversity Using Erasure Codes , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[3]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[4]  Kannan Ramchandran,et al.  The MDS Queue: Analysing Latency Performance of Codes and Redundant Requests , 2012 .

[5]  Kannan Ramchandran,et al.  The MDS queue: Analysing the latency performance of erasure codes , 2012, 2014 IEEE International Symposium on Information Theory.

[6]  Ulas C. Kozat,et al.  On the Throughput Capacity of Opportunistic Multicasting with Erasure Codes , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[7]  Muriel Médard,et al.  Toward sustainable networking: Storage area networks with network coding , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[8]  Ahmed Serhrouchni,et al.  Evaluating Forward Error Correction performance in BitTorrent protocol , 2010, IEEE Local Computer Network Conference.

[9]  Chen-Khong Tham,et al.  Minimizing Delay for Multicast-Streaming in Wireless Networks with Network Coding , 2009, IEEE INFOCOM 2009.

[10]  Rodrigo Rodrigues,et al.  High Availability in DHTs: Erasure Coding vs. Replication , 2005, IPTPS.

[11]  Michael Mitzenmacher,et al.  Accessing multiple mirror sites in parallel: using Tornado codes to speed up downloads , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[12]  Simson L. Garfinkel,et al.  An Evaluation of Amazon's Grid Computing Services: EC2, S3, and SQS , 2007 .

[13]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.

[14]  Kannan Ramchandran,et al.  Codes can reduce queueing delay in data centers , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[15]  Emin Gabrielyan Fault-Tolerant Real-Time Streaming with FEC thanks to Capillary Multi-Path Routing , 2004, ICCCAS 2004.

[16]  Asuman E. Ozdaglar,et al.  On the Delay and Throughput Gains of Coding in Unreliable Networks , 2008, IEEE Transactions on Information Theory.

[17]  T. S. Eugene Ng,et al.  The Impact of Virtualization on Network Performance of Amazon EC2 Data Center , 2010, 2010 Proceedings IEEE INFOCOM.

[18]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[19]  Alexandros G. Dimakis,et al.  On the delay of network coding over line networks , 2009, 2009 IEEE International Symposium on Information Theory.

[20]  Xin Wang,et al.  Tree-structured Data Regeneration in Distributed Storage Systems with Regenerating Codes , 2010, 2010 Proceedings IEEE INFOCOM.

[21]  Alec Wolman,et al.  Stout: An Adaptive Interface to Scalable Cloud Storage , 2010, USENIX Annual Technical Conference.