论文信息 - Cybertron: pushing the limit on I/O reduction in data-parallel programs - 字舞流文

Cybertron: pushing the limit on I/O reduction in data-parallel programs

I/O reduction has been a major focus in optimizing data-parallel programs for big-data processing. While the current state-of-the-art techniques use static program analysis to reduce I/O, Cybertron proposes a new direction that incorporates runtime mechanisms to push the limit further on I/O reduction. In particular, Cybertron tracks how data is used in the computation accurately at runtime to filter unused data at finer granularity dynamically, beyond what current static-analysis based mechanisms are capable of, and to facilitate a new mechanism called constraint based encoding for more efficient encoding. Cybertron has been implemented and applied to production data-parallel programs; our extensive evaluations on real programs and real data have shown its effectiveness on I/O reduction over the existing mechanisms at reasonable CPU cost, and its improvement on end-to-end performance in various network environments.

Wenguang Chen | Wei Lin | Xi Wang | Tian Xiao | Jiaxing Zhang | Xu Zhao | Zhenyu Guo | Lidong Zhou | Chencheng Ye | Hucheng Zhou

[1] Antony I. T. Rowstron,et al. Rhea: Automatic Filtering for Unstructured Cloud Storage , 2013, NSDI.

[2] Wei Lin,et al. Microsoft Bing Peking University , 2022 .

[3] Peter Deutsch,et al. DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[4] Hao Wang,et al. Towards automatic generation of vulnerability-based signatures , 2006, 2006 IEEE Symposium on Security and Privacy (S&P'06).

[5] Jingren Zhou,et al. Incorporating partitioning and parallel plans into the SCOPE optimizer , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[6] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[7] Andrey Gubarev,et al. Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[8] Pete Wyckoff,et al. Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[9] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[10] Ding Yuan,et al. SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[11] Michael Isard,et al. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[12] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13] Christopher Ré,et al. Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[14] Michael Isard,et al. Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[15] Craig Chambers,et al. FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[16] Ian H. Witten,et al. Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[17] James C. King,et al. Symbolic execution and program testing , 1976, CACM.

[18] Yuanyuan Zhou,et al. Triage: diagnosing production run failures at the user's site , 2007, SOSP.

[19] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[20] Manuel Costa,et al. Bouncer: securing software by blocking bad input , 2008, WRAITS '08.

[21] Miguel Castro,et al. Better bug reporting with better privacy , 2008, ASPLOS 2008.

[22] Daniel J. Abadi,et al. Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[23] Nicolas Bruno,et al. SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[24] Nikolaj Bjørner,et al. Z3: An Efficient SMT Solver , 2008, TACAS.

[25] Michael Stonebraker,et al. C-Store: A Column-oriented DBMS , 2005, VLDB.

[26] Christopher Olston,et al. Automatic Optimization of Parallel Dataflow Programs , 2008, USENIX Annual Technical Conference.

[27] Jiaxing Zhang,et al. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE , 2012, OSDI.

[28] Dawson R. Engler,et al. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.