FvRS: Efficiently identifying performance-critical data for improving performance of big data processing

Abstract Hybrid storage is widely implemented in big data processing by providing large storage capacity and high access speed in an economical manner. Performance-critical data are usually stored in SSD to obtain the most performance benefits with the least storage cost. Conventional scheme identifies performance-critical data based on data’s access hotness. But it does not consider data’s I/O cost and may store low-cost data in SSD, resulting in the waste of SSD. The recently proposed scheme determines performance-critical data based on both access hotness and I/O cost. However, it fails to accurately evaluate I/O cost, thus still distributes many low-cost data on SSD. In this paper, we propose a sequentiality-aware identification scheme for performance-critical data, called FvRS, which boots the accuracy of I/O cost evaluation by exploiting data’s access sequentiality. The key idea is to evaluate data’s I/O cost based on both request size and access sequentiality. By properly identifying high-cost hot data, FvRS maximizes the utilization of SSD to improve system performance. In addition, FvRS maintains performance-critical data in a real-time table to reduce the identification overhead. We have implemented FvRS in a hybrid storage system in Linux. Extensive evaluations using three real-workload traces and a famous benchmark Postmark demonstrate the accuracy and efficiency of FvRS. Compared with the state-of-the-art schemes, such as hotness-based identification and cost-based identification, FvRS reduces I/O response time by 10.3% ∼ 45.6% and 16.3% ∼ 25.1%, respectively.

[1]  Himabindu Pucha,et al.  Cost Effective Storage using Extent Based Dynamic Tiering , 2011, FAST.

[2]  Hong Jiang,et al.  LDM: Log Disk Mirroring with Improved Performance and Reliability for SSD-Based Disk Arrays , 2016, TOS.

[3]  Pi-Cheng Hsiu,et al.  A Hybrid Storage Access Framework for High-Performance Virtual Machines , 2014, TECS.

[4]  Gong Zhang,et al.  Adaptive Data Migration in Multi-tiered Storage Based Cloud Environment , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[5]  Tei-Wei Kuo,et al.  An adaptive striping architecture for flash memory storage systems of embedded systems , 2002, Proceedings. Eighth IEEE Real-Time and Embedded Technology and Applications Symposium.

[6]  Xiaojun Ruan,et al.  Improving Shuffle I/O performance for big data processing using hybrid storage , 2017, 2017 International Conference on Computing, Networking and Communications (ICNC).

[7]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX ATC.

[8]  Lipeng Wan,et al.  Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems , 2017, J. Parallel Distributed Comput..

[9]  Jiayang Zheng,et al.  A light-weight log-based hybrid storage system , 2018, J. Parallel Distributed Comput..

[10]  Jing Li,et al.  Hotness-aware buffer management for flash-based hybrid storage systems , 2013, CIKM.

[11]  Dan Feng,et al.  Improving flash-based disk cache with Lazy Adaptive Replacement , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[12]  Zhipeng Li,et al.  Grouping-Based Elastic Striping with Hotness Awareness for Improving SSD RAID Performance , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[13]  Dong Li,et al.  PCM-Based Durable Write Cache for Fast Disk I/O , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[14]  Feng Chen,et al.  Hystor: making the best use of solid state drives in high performance storage systems , 2011, ICS '11.

[15]  Antony I. T. Rowstron,et al.  Write off-loading: Practical power management for enterprise storage , 2008, TOS.

[16]  Yao Sun,et al.  Dynamic Data Reallocation in Hybrid Disk Arrays , 2010, IEEE Transactions on Parallel and Distributed Systems.

[17]  Kenneth A. Ross,et al.  SSD bufferpool extensions for database systems , 2010, Proc. VLDB Endow..

[18]  Sam H. Noh,et al.  Towards SLO Complying SSDs Through OPS Isolation , 2015, FAST.

[19]  Hong Jiang,et al.  HPDA: A hybrid parity-based disk array for enhanced performance and reliability , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[20]  Ali Raza Butt,et al.  On Efficient Hierarchical Storage for Big Data Processing , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[21]  Jianxi Chen,et al.  Accelerating File System Metadata Access with Byte-Addressable Nonvolatile Memory , 2015, TOS.

[22]  Gong Zhang,et al.  Automated lookahead data migration in SSD-enabled multi-tiered storage systems , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[23]  Jie Xu,et al.  An efficient PCM-based main memory system via exploiting fine-grained dirtiness of cachelines , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Stratis Viglas,et al.  Flashing up the storage layer , 2008, Proc. VLDB Endow..

[25]  Qing Yang,et al.  I-CASH: Intelligently Coupled Array of SSD and HDD , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[26]  Xian-He Sun,et al.  S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.