Web-Scale Data Management for the Cloud

The efficient management of a consistent and integrated database is a central task in modern IT and highly relevant for science and industry. Hardly any critical enterprise solution comes without any functionality for managing data in its different forms. Web-Scale Data Management for the Cloud addresses fundamental challenges posed by the need and desire to provide database functionality in the context of the Database as a Service (DBaaS) paradigm for database outsourcing. This book also discusses the motivation of the new paradigm of cloud computing, and its impact to data outsourcing and service-oriented computing in data-intensive applications. Techniques with respect to the support in the current cloud environments, major challenges, and future trends are covered in the last section of this book. A survey addressing the techniques and special requirements for building database services are provided in this book as well.

[1]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[2]  Katrin Braunschweig,et al.  The State of Open Data Limits of Current Open Data Platforms , 2012 .

[3]  Nancy A. Lynch,et al.  Revisiting the PAXOS algorithm , 1997, Theor. Comput. Sci..

[4]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[5]  Michael Stonebraker,et al.  Concurrency Control and Consistency of Multiple Copies of Data in Distributed Ingres , 1979, IEEE Transactions on Software Engineering.

[6]  Qian Zhu,et al.  Resource Provisioning with Budget Constraints for Adaptive Applications in Cloud Environments , 2010, IEEE Transactions on Services Computing.

[7]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[8]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[9]  Mihir Bellare,et al.  Deterministic and Efficiently Searchable Encryption , 2007, CRYPTO.

[10]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Christopher Olston,et al.  Automatic Optimization of Parallel Dataflow Programs , 2008, USENIX Annual Technical Conference.

[13]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[14]  David K. Gifford,et al.  Weighted voting for replicated data , 1979, SOSP '79.

[15]  Gustavo Alonso,et al.  Consistency Rationing in the Cloud: Pay only when it matters , 2009, Proc. VLDB Endow..

[16]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[17]  Nigel Ellis,et al.  Extreme scale with full SQL language support in microsoft SQL Azure , 2010, SIGMOD Conference.

[18]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[19]  Michael Isard,et al.  Autopilot: automatic data center management , 2007, OPSR.

[20]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[21]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[22]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[23]  Marvin Theimer,et al.  Session guarantees for weakly consistent replicated data , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[24]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[25]  Gerhard Weikum,et al.  Unbundling Transaction Services in the Cloud , 2009, CIDR.

[26]  Tim Kraska,et al.  Building a database on S3 , 2008, SIGMOD Conference.

[27]  Tim Kraska,et al.  An evaluation of alternative architectures for transaction processing in the cloud , 2010, SIGMOD Conference.

[28]  Mohamed F. Mokbel,et al.  Deuteronomy: Transaction Support for Cloud Data , 2011, CIDR.

[29]  Florian Schintke,et al.  Scalaris: reliable transactional p2p key/value store , 2008, ERLANG '08.

[30]  Hakan Hacigümüs,et al.  Executing SQL over encrypted data in the database-service-provider model , 2002, SIGMOD '02.

[31]  Matthew K. Franklin,et al.  Identity-Based Encryption from the Weil Pairing , 2001, CRYPTO.

[32]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[33]  Calton Pu,et al.  Intelligent management of virtualized resources for database systems in cloud environment , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[34]  Paul Marshall,et al.  Elastic Site: Using Clouds to Elastically Extend Site Resources , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[35]  Silvio Micali,et al.  Probabilistic Encryption , 1984, J. Comput. Syst. Sci..

[36]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[37]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[38]  Andreas Reuter,et al.  Principles of transaction-oriented database recovery , 1983, CSUR.

[39]  Craig Gentry,et al.  Implementing Gentry's Fully-Homomorphic Encryption Scheme , 2011, EUROCRYPT.

[40]  Robert H. Thomas,et al.  A Majority consensus approach to concurrency control for multiple copy databases , 1979, ACM Trans. Database Syst..

[41]  Goetz Graefe Parallel Query Execution Algorithms , 2009, Encyclopedia of Database Systems.

[42]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[43]  Thomas Neumann,et al.  Efficiently Compiling Efficient Query Plans for Modern Hardware , 2011, Proc. VLDB Endow..

[44]  Jie Li,et al.  Early observations on the performance of Windows Azure , 2010, HPDC '10.

[45]  Bernd Freisleben,et al.  On-Demand Resource Provisioning for BPEL Workflows Using Amazon's Elastic Compute Cloud , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[46]  Goetz Graefe Modern B-Tree Techniques , 2011, Found. Trends Databases.

[47]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[48]  Dan Boneh,et al.  Evaluating 2-DNF Formulas on Ciphertexts , 2005, TCC.

[49]  H SchollMarc,et al.  Transactional information systems , 2001 .

[50]  Artur Andrzejak,et al.  Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[51]  Werner Vogels,et al.  Eventually consistent , 2008, CACM.

[52]  Christian S. Jensen,et al.  Google fusion tables: web-centered data management and collaboration , 2010, SIGMOD Conference.

[53]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[54]  Jorge-Arnulfo Quiané-Ruiz,et al.  Runtime measurements in the cloud , 2010, Proc. VLDB Endow..

[55]  Yasushi Saito,et al.  Optimistic replication , 2005, CSUR.

[56]  Liang-Teh Lee,et al.  An Extenics-Based Dynamic Resource Adjustment for the Virtual Machine in Cloud Computing Environment , 2011, 2011 Fourth International Conference on Ubi-Media Computing.

[57]  Hari Balakrishnan,et al.  CryptDB: protecting confidentiality with encrypted query processing , 2011, SOSP.

[58]  Odej Kao,et al.  Nephele: efficient parallel data processing in the cloud , 2009, MTAGS '09.

[59]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[60]  David Maier,et al.  The Theory of Relational Databases , 1983 .

[61]  Eric A. Brewer,et al.  Towards robust distributed systems (abstract) , 2000, PODC '00.

[62]  Kenneth A. Ross,et al.  Reusing invariants: a new strategy for correlated queries , 1998, SIGMOD '98.

[63]  Elaine Shi,et al.  Multi-Dimensional Range Query over Encrypted Data , 2007, 2007 IEEE Symposium on Security and Privacy (SP '07).

[64]  Liang Lin,et al.  Tenzing a SQL implementation on the MapReduce framework , 2011, Proc. VLDB Endow..

[65]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[66]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[67]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[68]  Edward Walker,et al.  Benchmarking Amazon EC2 for High-Performance Scientific Computing , 2008, login Usenix Mag..

[69]  Divyakant Agrawal,et al.  G-Store: a scalable data store for transactional multi key access in the cloud , 2010, SoCC '10.

[70]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[71]  Leslie Lamport,et al.  Consensus on transaction commit , 2004, TODS.

[72]  Thomas Neumann Query Optimization (in Relational Databases) , 2009, Encyclopedia of Database Systems.

[73]  Quanyan Zhu,et al.  Dynamic Resource Allocation for Spot Markets in Cloud Computing Environments , 2011, 2011 Fourth IEEE International Conference on Utility and Cloud Computing.

[74]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[75]  GhemawatSanjay,et al.  The Google file system , 2003 .

[76]  T. S. Eugene Ng,et al.  The Impact of Virtualization on Network Performance of Amazon EC2 Data Center , 2010, 2010 Proceedings IEEE INFOCOM.

[77]  Gustavo Alonso,et al.  Database Replication: A Tutorial , 2010, Replication.

[78]  Chao-Tung Yang,et al.  Green Power Management with Dynamic Resource Allocation for Cloud Virtual Machines , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[79]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[80]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[81]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[82]  Dan Pritchett,et al.  BASE: An Acid Alternative , 2008, ACM Queue.

[83]  Rajarshi Das,et al.  Achieving Self-Management via Utility Functions , 2007, IEEE Internet Computing.

[84]  Doug Terry,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[85]  David J. DeWitt,et al.  GAMMA - A High Performance Dataflow Database Machine , 1986, VLDB.

[86]  Goetz Graefe,et al.  Hash Joins and Hash Teams in Microsoft SQL Server , 1998, VLDB.

[87]  Andreas Thor,et al.  Parallel Sorted Neighborhood Blocking with MapReduce , 2011, BTW.

[88]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[89]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[90]  Hidehiko Tanaka,et al.  An Overview of The System Software of A Parallel Relational Database Machine GRACE , 1986, VLDB.

[91]  Marcos K. Aguilera,et al.  Stable Leader Election , 2001, DISC.

[92]  M. Bellare,et al.  Searchable Encryption Revisited: Consistency Properties, Relation to Anonymous IBE, and Extensions , 2008, Journal of Cryptology.

[93]  Nathan Chenette,et al.  Order-Preserving Symmetric Encryption , 2009, IACR Cryptol. ePrint Arch..

[94]  Alexandru Iosup,et al.  A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing , 2009, CloudComp.

[95]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[96]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[97]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[98]  Matei Ripeanu,et al.  Amazon S3 for science grids: a viable solution? , 2008, DADC '08.

[99]  Ivan Damgård,et al.  A Generalisation, a Simplification and Some Applications of Paillier's Probabilistic Public-Key System , 2001, Public Key Cryptography.

[100]  Craig Gentry,et al.  Fully homomorphic encryption using ideal lattices , 2009, STOC '09.

[101]  Rafail Ostrovsky,et al.  Replication is not needed: single database, computationally-private information retrieval , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[102]  Ashraf Aboulnaga,et al.  Automatic virtual machine configuration for database workloads , 2008, SIGMOD Conference.

[103]  Dennis Shasha,et al.  The dangers of replication and a solution , 1996, SIGMOD '96.

[104]  Jennifer Widom,et al.  Database systems - the complete book (2. ed.) , 2009 .

[105]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[106]  Gustavo Alonso,et al.  Are quorums an alternative for data replication? , 2003, TODS.

[107]  Dominic Battré,et al.  Nephele/PACTs: a programming model and execution framework for web-scale analytical processing , 2010, SoCC '10.

[108]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[109]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[110]  Volker Markl,et al.  LEO: An autonomic query optimizer for DB2 , 2003, IBM Syst. J..