Pattern Aware Cache Management for Efficient Subset Retrieving of Astronomical Image Data

FITS(Flexible Image Transport System) is the most widely used data format in astronomy. The size of one FITS file ranges from Megabytes(MB) to Gigabytes(GB), even to Terabytes(TB), and astronomers are among the first researchers to encounter Big Data. For most cases astronomers are only interested in certain small sub-area of time series image. However loading the whole raw FITS file from HDD(Hard Disk Drive) every time then cutting it for the target sub-area is both time consuming and I/O wasting, and there is no existing cache scheme optimized for the subset retrieval of FITS files. By recognizing the hot sub-areas according to the latest query history, loading and merging related sub-images via a coordinate-mapping algorithm, we proposed a Pattern-Aware(PA) cache management strategy to efficiently retrieve sub-image data from huge amounts of FITS files. Our novel method was compared with traditional LRU, LFU and LRFU strategies on full FITS files and sub-files respectively. The results show that our PA strategy can maintain a high hit ratio of 64.32%, and reduce the average response time by about 24% than the best of these traditional schemes. These results are achieved with a cache to raw requested data size ratio of 8.77%.

[1]  Sang Lyul Min,et al.  LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies , 2001, IEEE Trans. Computers.

[2]  Tao Xie,et al.  An SSD-HDD Integrated Storage Architecture for Write-Once-Read-Once Applications on Clusters , 2015, 2015 IEEE International Conference on Cluster Computing.

[3]  Wentian Li,et al.  Zipf's Law everywhere , 2002, Glottometrics.

[4]  Lada A. Adamic,et al.  Zipf's law and the Internet , 2002, Glottometrics.

[5]  Gerhard Weikum,et al.  The LRU-K page replacement algorithm for database disk buffering , 1993, SIGMOD Conference.

[6]  Hee Yong Youn,et al.  Considering block popularity in disk cache replacement for enhancing hit ratio with solid state drive , 2015, 2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).

[7]  William Pence,et al.  Fv: Interactive FITS file editor , 2012 .

[8]  Jizhou Sun,et al.  AQUAdex: A Highly Efficient Indexing and Retrieving Method for Astronomical Big Data of Time Series Images , 2015, ICA3PP.

[9]  Song Jiang,et al.  iTransformer: Using SSD to Improve Disk Scheduling for High-performance I/O , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[10]  Nimrod Megiddo,et al.  ARC: A Self-Tuning, Low Overhead Replacement Cache , 2003, FAST.

[11]  Dan Feng,et al.  Improving flash-based disk cache with Lazy Adaptive Replacement , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[12]  G. Bruce Berriman,et al.  How Will Astronomy Archives Survive the Data Tsunami? , 2011, ACM Queue.

[13]  Marilyn Wolf,et al.  Effective caching of Web objects using Zipf's law , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[14]  J. T. Robinson,et al.  Data cache management using frequency-based replacement , 1990, SIGMETRICS '90.

[15]  William D. Pence,et al.  CFITSIO: A FITS File Subroutine Library , 2010 .

[16]  Xin Huang,et al.  A cost-aware region-level data placement scheme for hybrid parallel I/O systems , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[17]  Song Jiang,et al.  iBridge: Improving Unaligned Parallel File Access with Solid-State Drives , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[18]  Yuanyuan Zhou,et al.  The Multi-Queue Replacement Algorithm for Second Level Buffer Caches , 2001, USENIX Annual Technical Conference, General Track.

[19]  Clive G. Page,et al.  Definition of the Flexible Image Transport System (FITS), version 3.0 , 2010 .

[20]  Mithuna Thottethodi,et al.  SieveStore: a highly-selective, ensemble-level disk cache for cost-performance , 2010, ISCA '10.

[21]  Dan Feng,et al.  A Regional Popularity-Aware Cache replacement algorithm to improve the performance and lifetime of SSD-based disk cache , 2015, 2015 IEEE International Conference on Networking, Architecture and Storage (NAS).

[22]  Zhan-sheng Li,et al.  CRFP: A Novel Adaptive Replacement Policy Combined the LRU and LFU Policies , 2008, 2008 IEEE 8th International Conference on Computer and Information Technology Workshops.