Cybertron: pushing the limit on I/O reduction in data-parallel programs

I/O reduction has been a major focus in optimizing data-parallel programs for big-data processing. While the current state-of-the-art techniques use static program analysis to reduce I/O, Cybertron proposes a new direction that incorporates runtime mechanisms to push the limit further on I/O reduction. In particular, Cybertron tracks how data is used in the computation accurately at runtime to filter unused data at finer granularity dynamically, beyond what current static-analysis based mechanisms are capable of, and to facilitate a new mechanism called constraint based encoding for more efficient encoding. Cybertron has been implemented and applied to production data-parallel programs; our extensive evaluations on real programs and real data have shown its effectiveness on I/O reduction over the existing mechanisms at reasonable CPU cost, and its improvement on end-to-end performance in various network environments.

[1]  Antony I. T. Rowstron,et al.  Rhea: Automatic Filtering for Unstructured Cloud Storage , 2013, NSDI.

[2]  Wei Lin,et al.  Microsoft Bing Peking University , 2022 .

[3]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[4]  Hao Wang,et al.  Towards automatic generation of vulnerability-based signatures , 2006, 2006 IEEE Symposium on Security and Privacy (S&P'06).

[5]  Jingren Zhou,et al.  Incorporating partitioning and parallel plans into the SCOPE optimizer , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[6]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[7]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[8]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[9]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[10]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[11]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[14]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[15]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[16]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[17]  James C. King,et al.  Symbolic execution and program testing , 1976, CACM.

[18]  Yuanyuan Zhou,et al.  Triage: diagnosing production run failures at the user's site , 2007, SOSP.

[19]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[20]  Manuel Costa,et al.  Bouncer: securing software by blocking bad input , 2008, WRAITS '08.

[21]  Miguel Castro,et al.  Better bug reporting with better privacy , 2008, ASPLOS 2008.

[22]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[23]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[24]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[25]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[26]  Christopher Olston,et al.  Automatic Optimization of Parallel Dataflow Programs , 2008, USENIX Annual Technical Conference.

[27]  Jiaxing Zhang,et al.  Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE , 2012, OSDI.

[28]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.