A Task-Aware Fine-Grained Storage Selection Mechanism for In-Memory Big Data Computing Frameworks

In-memory big data computing, widely used in hot areas such as deep learning and artificial intelligence, can meet the demands of ultra-low latency service and real-time data analysis. However, existing in-memory computing frameworks usually use memory in an aggressive way. Memory space is quickly exhausted and leads to great performance degradation or even task failure. On the other hand, the increasing volumes of raw data and intermediate data introduce huge memory demands, which further deteriorate the short of memory. To release the pressure on memory, those in-memory frameworks provide various storage schemes options for caching data, which determines where and how data is cached. But their storage scheme selection mechanisms are simple and insufficient, always manually set by users. Besides, those coarse-grained data storage mechanisms cannot satisfy memory access patterns of each computing unit which works on only part of the data. In this paper, we proposed a novel task-aware fine-grained storage scheme auto-selection mechanism. It automatically determines the storage scheme for caching each data block, which is the smallest unit during computing. The caching decision is made by considering the future tasks, real-time resource utilization, and storage costs, including block creation costs, I/O costs, and serialization costs under each storage scenario. The experiments show that our proposed mechanism, compared with the default storage setting, can offer great performance improvement, especially in memory-constrained circumstances it can be as much as 78%.

[1]  Khaled Ben Letaief,et al.  LRC: Dependency-aware cache management for data analytics clusters , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[2]  Yao Zhao,et al.  An adaptive tuning strategy on spark based on in-memory computation characteristics , 2016, 2016 18th International Conference on Advanced Communication Technology (ICACT).

[3]  Kenli Li,et al.  Selection and replacement algorithms for memory performance improvement in Spark , 2016, Concurr. Comput. Pract. Exp..

[4]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[5]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[6]  Goran Nenadic,et al.  Towards Automatic Memory Tuning for In-Memory Big Data Analytics in Clusters , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[7]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[8]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[9]  Li Zhang,et al.  MRONLINE: MapReduce online performance tuning , 2014, HPDC '14.

[10]  I. Stephen Choi,et al.  Early experience with optimizing I/O performance using high-performance SSDs for in-memory cluster computing , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[11]  Alfred V. Aho,et al.  Principles of Optimal Page Replacement , 1971, J. ACM.

[12]  Dick H. J. Epema,et al.  Towards Machine Learning-Based Auto-tuning of MapReduce , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[13]  Carlo Curino,et al.  Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications , 2015, SIGMOD Conference.

[14]  Xiaobo Zhou,et al.  Improving MapReduce performance in heterogeneous environments with adaptive task tuning , 2014, Middleware.

[15]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[16]  Maozhen Li,et al.  Optimizing hadoop parameter settings with gene expression programming guided PSO , 2017, Concurr. Comput. Pract. Exp..

[17]  Hai Jin,et al.  LCS: An Efficient Data Eviction Strategy for Spark , 2016, International Journal of Parallel Programming.

[18]  Chao-Chun Yeh,et al.  Machine Learning-Based Configuration Parameter Tuning on Hadoop System , 2015, 2015 IEEE International Congress on Big Data.

[19]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[20]  Li Zhang,et al.  MEMTUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[21]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[22]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[23]  Ke Zhang,et al.  A New Scheme for Cache Optimization Based on Cluster Computing Framework Spark , 2015, 2015 8th International Symposium on Computational Intelligence and Design (ISCID).

[24]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[25]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[26]  Deze Zeng,et al.  MR-COF: A Genetic MapReduce Configuration Optimization Framework , 2015, ICA3PP.

[27]  Beng Chin Ooi,et al.  In-Memory Big Data Management and Processing: A Survey , 2015, IEEE Transactions on Knowledge and Data Engineering.

[28]  Li Zhang,et al.  SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[29]  Anna R. Karlin,et al.  Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling , 1996, TOCS.