论文信息 - Online Expansion of Largescale Data Warehouses

Online Expansion of Largescale Data Warehouses

Modern data warehouses store exceedingly large amounts of data, generally considered the crown jewels of an enterprise. The amount of data maintained in such data warehouses increases significantly over time—often at a continuous pace, e.g., by gathering additional data or retaining data for longer periods to derive additional business value, but occasionally also precipitously, e.g., when consolidating disparate data warehouses and Data Marts into a single database. Having to expand a data warehouse with 100’s of TB of data by a substantial portion, e.g., 100% or more is a complex and disruptive maintenance operation as it typically involves some sort of dumping and reloading of data which requires substantial downtime. In this paper we describe the methodology and mechanisms we developed in Greenplum Database to expand largescale data warehouses in an online fashion, i.e., without noticeable downtime. At the core of our approach is a set of robust and transactionally consistent primitives that enable efficient data movement. Special emphasis was put on usability and control that lets an administrator tailor the expansion process to specific operational characteristics via priorities and schedules. We present a number of experiments to quantify the impact of an on-going expansion on query workloads.

[1] Donald Kossmann,et al. The state of the art in distributed query processing , 2000, CSUR.

[2] Balakrishna R. Iyer,et al. Online reorganization of databases , 2009, CSUR.

[3] Werner Vogels,et al. Dynamo: amazon's highly available key-value store , 2007, SOSP.

[4] Joseph M. Hellerstein,et al. MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[5] Florian Waas. Beyond Conventional Data Warehousing - Massively Parallel Data Processing with Greenplum Database - (Invited Talk) , 2008, BIRTE.

[6] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7] Prashant Malik,et al. Cassandra: structured storage system on a P2P network , 2009, PODC '09.

[8] David R. Karger,et al. Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[9] Sivaramakrishnan Narayanan,et al. Dynamic prioritization of database queries , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[10] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.