Rhea: Automatic Filtering for Unstructured Cloud Storage

Unstructured storage and data processing using platforms such as MapReduce are increasingly popular for their simplicity, scalability, and flexibility. Using elastic cloud storage and computation makes them even more attractive. However cloud providers such as Amazon and Windows Azure separate their storage and compute resources even within the same data center. Transferring data from storage to compute thus uses core data center network bandwidth, which is scarce and oversubscribed. As the data is unstructured, the infrastructure cannot automatically apply selection, projection, or other filtering predicates at the storage layer. The problem is even worse if customers want to use compute resources on one provider but use data stored with other provider(s). The bottleneck is now the WAN link which impacts performance but also incurs egress bandwidth charges. This paper presents Rhea, a system to automatically generate and run storage-side data filters for unstructured and semi-structured data. It uses static analysis of application code to generate filters that are safe, stateless, side effect free, best effort, and transparent to both storage and compute layers. Filters never remove data that is used by the computation. Our evaluation shows that Rhea filters achieve a reduction in data transfer of 2x- 20,000x, which reduces job run times by up to 5x and dollar costs for cross-cloud computations by up to 13x.

[1]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[2]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[3]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[4]  Wei Lin,et al.  Microsoft Bing Peking University , 2022 .

[5]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[6]  ReedBenjamin,et al.  Building a high-level dataflow system on top of Map-Reduce , 2009, VLDB 2009.

[7]  Andreas Podelski,et al.  Termination proofs for systems code , 2006, PLDI '06.

[8]  Christos Faloutsos,et al.  Active Disks for Large-Scale Data Processing , 2001, Computer.

[9]  Simson L. Garfinkel,et al.  An Evaluation of Amazon's Grid Computing Services: EC2, S3, and SQS , 2007 .

[10]  Flemming Nielson,et al.  Principles of Program Analysis , 1999, Springer Berlin Heidelberg.

[11]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[12]  William R. Cook,et al.  Extracting queries by static analysis of transparent persistence , 2007, POPL '07.

[13]  William R. Cook,et al.  Interprocedural query extraction for transparent persistence , 2008, OOPSLA.

[14]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[15]  David Pichardie,et al.  Sawja: Static Analysis Workshop for Java , 2010, FoVeOOS.

[16]  Mahadev Satyanarayanan,et al.  Diamond: A Storage Architecture for Early Discard in Interactive Search , 2004, FAST.

[17]  Rupak Majumdar,et al.  Path slicing , 2005, PLDI '05.

[18]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[19]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[20]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[21]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[22]  Willy Zwaenepoel,et al.  HadoopToSQL: a mapReduce query optimizer , 2010, EuroSys '10.

[23]  Gurmeet Singh Manku,et al.  RadixZip: Linear-Time Compression of Token Streams , 2007, VLDB.

[24]  Antoine Miné,et al.  The octagon abstract domain , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[25]  Sumit Gulwani,et al.  SPEED: precise and efficient static estimation of program computational complexity , 2009, POPL '09.

[26]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[27]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[28]  Astrid Rheinländer,et al.  Opening the Black Boxes in Data Flow Optimization , 2012, Proc. VLDB Endow..

[29]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[30]  David W. Binkley,et al.  Interprocedural slicing using dependence graphs , 1990, TOPL.

[31]  Alvin Cheung,et al.  Automatic Partitioning of Database Applications , 2012, Proc. VLDB Endow..

[32]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..