Where in the world is my data?

Users of websites such as Facebook, Ebay and Yahoo! demand fast response times, and these sites replicate data across globally distributed datacenters to achieve this. However, it is not necessary to replicate all data to all locations: if a European user's record is never accessed in Asia, it does not make sense to pay the bandwidth and disk costs to maintain an Asian replica. In this paper, we describe mechanisms for selectively replicating large-scale web databases on a record-by-record basis. We introduce a flexible constraint language to specify replication policy constraints. We then present an adaptive scheme for replicating data to where it is most frequently accessed, while respecting policy constraints and using minimal bookkeeping. Experiments using a modified version of our PNUTS system demonstrate our techniques work well.

[1]  Witold Litwin,et al.  LH*RS: a high-availability scalable distributed data structure using Reed Solomon Codes , 2000, SIGMOD '00.

[2]  Jakob Nielsen,et al.  Usability engineering , 1997, The Computer Science and Engineering Handbook.

[3]  Edward Y. Chang,et al.  Data management projects at Google , 2008, SGMD.

[4]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[5]  Amin Vahdat,et al.  Minimal Cost Replication for Availability , 2002, PODC 2002.

[6]  Prashant Malik,et al.  Cassandra: structured storage system on a P2P network , 2009, PODC '09.

[7]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[8]  Peter M G Apers,et al.  Data allocation in distributed database systems , 1988, TODS.

[9]  Michael J. Franklin,et al.  Cache investment: integrating query optimization and distributed data placement , 2000, TODS.

[10]  Siddhartha Annapureddy,et al.  Shark: scaling file servers via cooperative caching , 2005, NSDI.

[11]  Divesh Srivastava,et al.  Semantic Data Caching and Replacement , 1996, VLDB.

[12]  Raghu Ramakrishnan,et al.  Feeding frenzy: selectively materializing users' event feeds , 2010, SIGMOD Conference.

[13]  Lakshmish Ramaswamy,et al.  Cache Clouds: Cooperative Caching of Dynamic Documents in Edge Networks , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[14]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[15]  Sriram Padmanabhan,et al.  DBProxy: a dynamic data cache for web applications , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[16]  Rajmohan Rajaraman,et al.  Placement Algorithms for Hierarchical Cooperative Caching , 2001, J. Algorithms.

[17]  Michael Stonebraker,et al.  Data replication in Mariposa , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[18]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[19]  Edith Cohen,et al.  Replication strategies in unstructured peer-to-peer networks , 2002, SIGCOMM.

[20]  Sushil Jajodia,et al.  An adaptive data replication algorithm , 1997, TODS.

[21]  Arun Venkataramani,et al.  Bandwidth constrained placement in a WAN , 2001, PODC '01.

[22]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[23]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..