Heterogeneity-Aware Data Placement in Hybrid Clouds

In next-generation cloud computing clusters, performance of data-intensive applications will be limited, among other factors, by disks data transfer rates. In order to mitigate performance impacts, cloud systems offering hierarchical storage architectures are becoming commonplace. The Hadoop File System (HDFS) offers a collection of storage policies that exploit different storage types such as RAM_DISK, SSD, HDD, and ARCHIVE. However, developing algorithms to leverage heterogeneous storage through an efficient data placement has been challenging. This work presents an intelligent algorithm based on genetic programming which allow to find the optimal mapping of input datasets to storage types on a Hadoop file system.

[1]  Mitsuo Gen,et al.  A survey of penalty techniques in genetic algorithms , 1996, Proceedings of IEEE International Conference on Evolutionary Computation.

[2]  Yanpei Chen,et al.  The Truth About MapReduce Performance on SSDs , 2014, LISA.

[3]  Dan Feng,et al.  CDRM: A Cost-Effective Dynamic Replication Management Scheme for Cloud Storage Cluster , 2010, 2010 IEEE International Conference on Cluster Computing.

[4]  Jin Xiong,et al.  H-Scheduler: Storage-Aware Task Scheduling for Heterogeneous-Storage Spark Clusters , 2018, 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS).

[5]  Shanlin Yang,et al.  Big data driven smart energy management: From big data to big insights , 2016 .

[6]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[7]  Gary B. Lamont,et al.  Evolutionary Algorithms for Solving Multi-Objective Problems , 2002, Genetic Algorithms and Evolutionary Computation.

[8]  Ahmed E. Kamal,et al.  Optimal dataset allocation in distributed heterogeneous clouds , 2014, 2014 IEEE Globecom Workshops (GC Wkshps).

[9]  Mitsuo Gen,et al.  Genetic algorithms and engineering optimization , 1999 .

[10]  Dhabaleswar K. Panda,et al.  Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[11]  Ali Raza Butt,et al.  hatS: A Heterogeneity-Aware Tiered Storage for Hadoop , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[12]  Sang-Won Lee,et al.  A Case for Flash Memory SSD in Hadoop Applications , 2013 .

[13]  Fang Dong,et al.  Optimizing data placement in heterogeneous Hadoop clusters , 2015, Cluster Computing.

[14]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[15]  Peter M G Apers,et al.  Data allocation in distributed database systems , 1988, TODS.

[16]  Rohith Subramanyam HDFS Heterogeneous Storage Resource Management Based on Data Temperature , 2015, 2015 International Conference on Cloud and Autonomic Computing.

[17]  Ali Raza Butt,et al.  VENU: Orchestrating SSDs in hadoop storage , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Carlos A. Coello Coello,et al.  Constraint-handling in genetic algorithms through the use of dominance-based tournament selection , 2002, Adv. Eng. Informatics.

[20]  Ali Raza Butt,et al.  On Efficient Hierarchical Storage for Big Data Processing , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[21]  Hui Li,et al.  Distributed heterogeneous storage based on data value , 2017, 2017 IEEE 2nd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC).

[22]  Antony I. T. Rowstron,et al.  Migrating server storage to SSDs: analysis of tradeoffs , 2009, EuroSys '09.

[23]  Lalit M. Patnaik,et al.  Adaptive probabilities of crossover and mutation in genetic algorithms , 1994, IEEE Trans. Syst. Man Cybern..

[24]  Leon S. Lasdon,et al.  Generalized Reduced Gradient Software for Linearly and Nonlinearly Constrained Problems , 1978 .

[25]  Juan C. Moure,et al.  Job scheduling in Hadoop with Shared Input Policy and RAMDISK , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[26]  Edward L. Hannan,et al.  An algorithm for the multiple objective integer linear programming problem , 1982 .

[27]  Antoon W. J. Kolen A genetic algorithm for the partial binary constraint satisfaction problem: an application to a frequency assignment problem , 2006 .

[28]  Zbigniew Michalewicz,et al.  Handling Constraints in Genetic Algorithms , 1991, ICGA.

[29]  Pascal Bouvry,et al.  A Survey of Evolutionary Computation for Resource Management of Processing in Cloud Computing [Review Article] , 2015, IEEE Computational Intelligence Magazine.