Accelerating Big Data Applications on Tiered Storage System with Various Eviction Policies

Utilizing new type devices, such as SSD, to improve I/O performance of hybrid storage has become a tendency recently. Many efforts are made to apply the new type devices to hybrid storage in distributed environment, but most of them are confined to the specific file systems, such as HDFS. Besides, the low performance of HDFS descends the performance of hybrid storage. In this paper, we improve the performance of tiered storage system (one kind of hybrid storage system) in distributed environment with a plughable eviction framework considering that the data on each node is regularly accessed. On top of the eviction framework, we provide a couple of eviction policies, including LRU, LRFU, LIRS and ARC, covering different access patterns to accelerate the upper big data applications. Moreover, our design is general for all tiered storage systems. Then we evaluate the performance of our eviction framework through three widely-used big data applications and discover that LIRS can improve 30% hit ratio than most of other policies when running KMeans and PageRank, ARC can improve maximum 30% hit ratio than other policies when running complicated SQL applications, LRFU can always achieve relatively good performance when the configuration properties are set in reasonable range. We have implemented our prototype on Alluxio, which is a widely-used memory-centric distributed storage system. In addition, these eviction policies contributed by us have been merged into Alluxio and are already being in use.

[1]  Dhabaleswar K. Panda,et al.  Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[2]  Dan Feng,et al.  PAHDFS: Preference-Aware HDFS for Hybrid Storage , 2015, ICA3PP.

[3]  Nimrod Megiddo,et al.  ARC: A Self-Tuning, Low Overhead Replacement Cache , 2003, FAST.

[4]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[5]  Qing Yang,et al.  I-CASH: Intelligently Coupled Array of SSD and HDD , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[6]  Sang Lyul Min,et al.  LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies , 2001, IEEE Trans. Computers.

[7]  Nong Xiao,et al.  SAC: rethinking the cache replacement policy for SSD-based storage systems , 2012, SYSTOR '12.

[8]  Dongkun Shin,et al.  Recently-evicted-first buffer replacement policy for flash storage devices , 2008, IEEE Transactions on Consumer Electronics.

[9]  Ali Raza Butt,et al.  hatS: A Heterogeneity-Aware Tiered Storage for Hadoop , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[10]  Youyou Lu,et al.  Extending the lifetime of flash-based storage through reducing write amplification from file systems , 2013, FAST.

[11]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[12]  Song Jiang,et al.  LIRS: an efficient low inter-reference recency set replacement policy to improve buffer cache performance , 2002, SIGMETRICS '02.

[13]  Dan Feng,et al.  Improving flash-based disk cache with Lazy Adaptive Replacement , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[14]  Shengzhong Feng,et al.  Improving Data Locality of MapReduce by Scheduling in Homogeneous Computing Environments , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[15]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[16]  Dan Feng,et al.  Improving flash-based disk cache with Lazy Adaptive Replacement , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).