论文信息 - Global Analytics in the Face of Bandwidth and Regulatory Constraints

Global Analytics in the Face of Bandwidth and Regulatory Constraints

Global-scale organizations produce large volumes of data across geographically distributed data centers. Querying and analyzing such data as a whole introduces new research issues at the intersection of networks and databases. Today systems that compute SQL analytics over geographically distributed data operate by pulling all data to a central location. This is problematic at large data scales due to expensive transoceanic links, and may be rendered impossible by emerging regulatory constraints. The new problem of Wide-Area Big Data (WABD) consists in orchestrating query execution across data centers to minimize bandwidth while respecting regulatory constaints. WABD combines classical query planning with novel network-centric mechanisms designed for a wide-area setting such as pseudodistributed execution, joint query optimization, and deltas on cached subquery results. Our prototype, Geode, builds upon Hive and uses 250× less bandwidth than centralized analytics in a Microsoft production workload and up to 360× less on popular analytics benchmarks including TPC-CH and Berkeley Big Data. Geode supports all SQL operators, including Joins, across global data.

[1] Eugene Wong,et al. Query processing in a system for distributed databases (SDD-1) , 1981, TODS.

[2] Wesley W. Chu,et al. Optimal Query Processing for Distributed Database Systems , 1982, IEEE Transactions on Computers.

[3] Arun N. Swami,et al. Optimization of large join queries , 1988, SIGMOD '88.

[4] Guy M. Lohman,et al. Is query optimization a 'solved' problem? , 1989 .

[5] Patrick Valduriez,et al. Principles of Distributed Database Systems , 1990 .

[6] David J. DeWitt,et al. Parallel database systems: the future of high performance database systems , 1992, CACM.

[7] Goetz Graefe,et al. Query evaluation techniques for large databases , 1993, CSUR.

[8] Paul Mackerras,et al. The rsync algorithm , 1996 .

[9] Born To Be Parallel: Why Parallel Origins Give Teradata an Enduring Performance Edge , 1997, IEEE Data Eng. Bull..

[10] Donald Kossmann,et al. The state of the art in distributed query processing , 2000, CSUR.

[11] Lixin Gao. On inferring autonomous system relationships in the internet , 2001, TNET.

[12] Divyakant Agrawal,et al. Medians and beyond: new aggregation techniques for sensor networks , 2004, SenSys '04.

[13] Zhe Wang,et al. Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[14] Ying Xing,et al. A Cooperative, Self-Configuring High-Availability Solution for Stream Processing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[15] Albert G. Greenberg,et al. The cost of a cloud: research problems in data center networks , 2008, CCRV.

[16] Cynthia Dwork,et al. Differential Privacy: A Survey of Results , 2008, TAMC.

[17] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[18] Hans-Arno Jacobsen,et al. PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[19] Galen C. Hunt,et al. Debugging in the (very) large: ten years of implementation and experience , 2009, SOSP '09.

[20] Michael Vrable,et al. Cumulus: Filesystem backup to the cloud , 2009, TOS.

[21] Zheng Shao,et al. Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[22] Hakim Weatherspoon,et al. RACS: a case for cloud storage diversity , 2010, SoCC '10.

[23] Alec Wolman,et al. Volley: Automated Data Placement for Geo-Distributed Cloud Services , 2010, NSDI.

[24] Zheng Shao,et al. Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[25] Samuel Madden,et al. Database Abstractions for Managing Sensor Network Data , 2010, Proceedings of the IEEE.

[26] Adam Silberstein,et al. Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[27] Michael Sirivianos,et al. Inter-datacenter bulk transfers with netstitcher , 2011, SIGCOMM.

[28] Kirsten Bock. Privacy by Design and the New Protection Goals , 2011 .

[29] Martin Rost,et al. Privacy By Design und die Neuen Schutzziele , 2011, Datenschutz und Datensicherheit - DuD.

[30] Carlo Curino,et al. Lookup Tables: Fine-Grained Partitioning for Distributed Databases , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[31] Christopher Frost,et al. Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[32] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[33] Lei Gao,et al. Data Infrastructure at LinkedIn , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[34] Chuang Liu,et al. The Unified Logging Infrastructure for Data Analytics at Twitter , 2012, Proc. VLDB Endow..

[35] Minqing Hu,et al. BigBench: towards an industry standard benchmark for big data analytics , 2013, SIGMOD '13.

[36] Michael J. Freedman,et al. Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area , 2014, NSDI.

[37] Kyungho Jeon,et al. PigOut: Making multiple Hadoop clusters work together , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[38] Fan Yang,et al. Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing , 2014, Proc. VLDB Endow..

[39] Carlo Curino,et al. WANalytics: Analytics for a Geo-Distributed Data-Intensive World , 2015, CIDR.