Mitigating high latency outliers for cloud-based telecommunication services

Telecommunication applications are distinguished by their stringent requirements for availability and completion times. A highly available, low-latency, distributed data store is therefore a critical component of cloud-based realizations of telecommunication services. We present a systematic experimental evaluation of state-of-the-art database systems as components of telecommunication applications. We show that while their average latencies are well within the required time scales, the distribution of latencies exhibits a long tail of unacceptably large outliers which may significantly impair meeting the performance requirements of telecommunication applications. To address the observed phenomenon of high latency outliers, we present a new solution that is implemented in a Bell Labs system code named Flurry. Flurry is based on using the first response from a replica rather than waiting for all or a quorum of responses from replicas. To handle incorrect responses arising from message losses, Flurry uses a novel checking algorithm based on vector clocks to determine the correctness of a replica's response. We present experimental evaluation results which show that Flurry significantly reduces both the average response time and the probability of unacceptable response times to values that would allow meeting the availability and completion time thresholds required for telecommunication services. © 2012 Alcatel-Lucent.

[1]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[2]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[3]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[4]  Y. Jayanta Singh,et al.  Dynamic management of transactions in distributed real-time processing system , 2010, ArXiv.

[5]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[6]  Colin J. Fidge,et al.  Timestamps in Message-Passing Systems That Preserve the Partial Ordering , 1988 .

[7]  J. Chris Anderson,et al.  CouchDB: The Definitive Guide , 2010 .

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Flaviu Cristian,et al.  An efficient, fault-tolerant protocol for replicated data management , 1985, PODS '85.

[10]  Friedemann Mattern,et al.  Virtual Time and Global States of Distributed Systems , 2002 .

[11]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[12]  T. S. Eugene Ng,et al.  The Impact of Virtualization on Network Performance of Amazon EC2 Data Center , 2010, 2010 Proceedings IEEE INFOCOM.

[13]  Renata Teixeira,et al.  Explaining packet delays under virtualization , 2011, CCRV.

[14]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[15]  J. Chris Anderson,et al.  CouchDB - The Definitive Guide: Time to Relax , 2010 .

[16]  Zhan Yi Cloud computing in telecommunications operation support system , 2013 .

[17]  David K. Gifford,et al.  Weighted voting for replicated data , 1979, SOSP '79.