Global Analytics in the Face of Bandwidth and Regulatory Constraints

Global-scale organizations produce large volumes of data across geographically distributed data centers. Querying and analyzing such data as a whole introduces new research issues at the intersection of networks and databases. Today systems that compute SQL analytics over geographically distributed data operate by pulling all data to a central location. This is problematic at large data scales due to expensive transoceanic links, and may be rendered impossible by emerging regulatory constraints. The new problem of Wide-Area Big Data (WABD) consists in orchestrating query execution across data centers to minimize bandwidth while respecting regulatory constaints. WABD combines classical query planning with novel network-centric mechanisms designed for a wide-area setting such as pseudodistributed execution, joint query optimization, and deltas on cached subquery results. Our prototype, Geode, builds upon Hive and uses 250× less bandwidth than centralized analytics in a Microsoft production workload and up to 360× less on popular analytics benchmarks including TPC-CH and Berkeley Big Data. Geode supports all SQL operators, including Joins, across global data.

[1]  Eugene Wong,et al.  Query processing in a system for distributed databases (SDD-1) , 1981, TODS.

[2]  Wesley W. Chu,et al.  Optimal Query Processing for Distributed Database Systems , 1982, IEEE Transactions on Computers.

[3]  Arun N. Swami,et al.  Optimization of large join queries , 1988, SIGMOD '88.

[4]  Guy M. Lohman,et al.  Is query optimization a 'solved' problem? , 1989 .

[5]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[6]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[7]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[8]  Paul Mackerras,et al.  The rsync algorithm , 1996 .

[9]  Born To Be Parallel: Why Parallel Origins Give Teradata an Enduring Performance Edge , 1997, IEEE Data Eng. Bull..

[10]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[11]  Lixin Gao On inferring autonomous system relationships in the internet , 2001, TNET.

[12]  Divyakant Agrawal,et al.  Medians and beyond: new aggregation techniques for sensor networks , 2004, SenSys '04.

[13]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[14]  Ying Xing,et al.  A Cooperative, Self-Configuring High-Availability Solution for Stream Processing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[15]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[16]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[17]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[18]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[19]  Galen C. Hunt,et al.  Debugging in the (very) large: ten years of implementation and experience , 2009, SOSP '09.

[20]  Michael Vrable,et al.  Cumulus: Filesystem backup to the cloud , 2009, TOS.

[21]  Zheng Shao,et al.  Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[22]  Hakim Weatherspoon,et al.  RACS: a case for cloud storage diversity , 2010, SoCC '10.

[23]  Alec Wolman,et al.  Volley: Automated Data Placement for Geo-Distributed Cloud Services , 2010, NSDI.

[24]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[25]  Samuel Madden,et al.  Database Abstractions for Managing Sensor Network Data , 2010, Proceedings of the IEEE.

[26]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[27]  Michael Sirivianos,et al.  Inter-datacenter bulk transfers with netstitcher , 2011, SIGCOMM.

[28]  Kirsten Bock Privacy by Design and the New Protection Goals , 2011 .

[29]  Martin Rost,et al.  Privacy By Design und die Neuen Schutzziele , 2011, Datenschutz und Datensicherheit - DuD.

[30]  Carlo Curino,et al.  Lookup Tables: Fine-Grained Partitioning for Distributed Databases , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[31]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[32]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[33]  Lei Gao,et al.  Data Infrastructure at LinkedIn , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[34]  Chuang Liu,et al.  The Unified Logging Infrastructure for Data Analytics at Twitter , 2012, Proc. VLDB Endow..

[35]  Minqing Hu,et al.  BigBench: towards an industry standard benchmark for big data analytics , 2013, SIGMOD '13.

[36]  Michael J. Freedman,et al.  Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area , 2014, NSDI.

[37]  Kyungho Jeon,et al.  PigOut: Making multiple Hadoop clusters work together , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[38]  Fan Yang,et al.  Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing , 2014, Proc. VLDB Endow..

[39]  Carlo Curino,et al.  WANalytics: Analytics for a Geo-Distributed Data-Intensive World , 2015, CIDR.