An update-aware storage system for low-locality update-intensive workloads

Traditional storage systems provide a simple read/write interface, which is inadequate for low-locality update-intensive workloads because it limits the disk scheduling flexibility and results in inefficient use of buffer memory and raw disk bandwidth. This paper describes an update-aware disk access interface that allows applications to explicitly specify disk update requests and associate with such requests call-back functions that will be invoked when the requested disk blocks are brought into memory. Because call-back functions offer a continuation mechanism after retrieval of requested blocks, storage systems supporting this interface are given more flexibility in scheduling pending disk update requests. In particular, this interface enables a simple but effective technique called Batching mOdifications with Sequential Commit (BOSC), which greatly improves the sustained throughput of a storage system under low-locality update-intensive workloads. In addition, together with a space-efficient low-latency disk logging technique, BOSC is able to deliver the same durability guarantee as synchronous disk updates. Empirical measurements show that the random update throughput of a BOSC-based B+ tree is more than an order of magnitude higher than that of the same B+ tree implementation on a traditional storage system.

[1]  Yongdae Kim,et al.  Decentralized Authentication Mechanisms for Object-based Storage Devices , 2003, Second IEEE International Security in Storage Workshop.

[2]  Jeffrey Scott Vitter,et al.  I/O-efficient algorithms and environments , 1996, CSUR.

[3]  Robert B. Hagmann,et al.  Reimplementing the Cedar file system using logging and group commit , 1987, SOSP '87.

[4]  Krishna Bharat,et al.  The Term Vector Database: fast access to indexing terms for Web pages , 2000, Comput. Networks.

[5]  Yan Zhang,et al.  Empirical evaluation of multi-level buffer cache collaboration for storage systems , 2005, SIGMETRICS '05.

[6]  Berthier A. Ribeiro-Neto,et al.  Efficient distributed algorithms to build inverted files , 1999, SIGIR '99.

[7]  Michael A. Bender,et al.  Concurrent cache-oblivious b-trees , 2005, SPAA '05.

[8]  Alexander A. Stepanov,et al.  Mime: a high performance parallel storage device with strong recovery guarantees , 1997 .

[9]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[10]  Yale N. Patt,et al.  Soft updates: a solution to the metadata update problem in file systems , 2000 .

[11]  Yanping Zhao,et al.  HyLog: A High Performance Approach to Managing Disk Layout , 2004, FAST.

[12]  Joseph F. Murray,et al.  Reliability and security of RAID storage systems and D2D archives using SATA disk drives , 2005, TOS.

[13]  Jeffrey Scott Vitter,et al.  Implementing I/O-efficient Data Structures Using TPIE , 2002, ESA.

[14]  Christian S. Jensen,et al.  Main-Memory Operation Buffering for Efficient R-Tree Update , 2007, VLDB.

[15]  Huseyin Simitci,et al.  Evaluation of SCSI over TCP/IP and SCSI over fibre channel connections , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[16]  Garth A. Gibson,et al.  Active Disks: Remote Execution for Network-Attached Storage (CMU-CS-97-198) , 1997 .

[17]  W. Marsden I and J , 2012 .

[18]  Cyril U. Orji,et al.  Write-only disk cache experiments on multiple surface disks , 1992, Proceedings ICCI `92: Fourth International Conference on Computing and Information.

[19]  Pete Wyckoff,et al.  Attribute Storage Design for Object-based Storage Devices , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[20]  Stephen L. Scott,et al.  A unified multiple-level cache for high performance storage systems , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[21]  Michael A. Bender,et al.  The Cost of Cache-Oblivious Searching , 2010, Algorithmica.

[22]  Torsten Suel,et al.  I/O-efficient techniques for computing pagerank , 2002, CIKM '02.

[23]  Marianne Winslett,et al.  Trustworthy keyword search for regulatory-compliant records retention , 2006, VLDB.

[24]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[25]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[26]  Goetz Graefe,et al.  B-tree indexes for high update rates , 2006, SIGMOD Rec..

[27]  Sailesh Chutani,et al.  The Episode File System , 1992 .

[28]  Michael Stonebraker,et al.  The Design of the POSTGRES Storage System , 1988, VLDB.

[29]  Goetz Graefe,et al.  Sorting And Indexing With Partitioned B-Trees , 2003, CIDR.

[30]  Christian Engelmann,et al.  A unified multiple-level cache for high performance storage systems , 2007, Int. J. High Perform. Comput. Netw..

[31]  Asit Dan,et al.  An approximate analysis of the LRU and FIFO buffer replacement schemes , 1990, SIGMETRICS '90.

[32]  Gregory R. Ganger,et al.  Soft Updates: A Technique for Eliminating Most Synchronous Writes in the Fast Filesystem , 1999, USENIX Annual Technical Conference, FREENIX Track.

[33]  Sriram Raghavan,et al.  Building a distributed full-text index for the Web , 2001, WWW '01.

[34]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[35]  Irving L. Traiger,et al.  A history and evaluation of System R , 1981, CACM.

[36]  Jason Flinn,et al.  Rethink the sync , 2006, OSDI '06.

[37]  Andrea C. Arpaci-Dusseau,et al.  X-RAY: a non-invasive exclusive caching mechanism for RAIDs , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[38]  Xiang Yu,et al.  Configuring and Scheduling an Eager-Writing Disk Array for a Transaction Processing Workload , 2002, FAST.

[39]  Gregory R. Ganger,et al.  Freeblock Scheduling Outside of Disk Firmware , 2002, FAST.

[40]  Lars Arge,et al.  The Buffer Tree: A New Technique for Optimal I/O-Algorithms (Extended Abstract) , 1995, WADS.

[41]  Tzi-cker Chiueh,et al.  I/O-Conscious Data Preparation for Large-Scale Web Search Engines , 2002, VLDB.

[42]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[43]  Jeffrey Scott Vitter,et al.  Bkd-Tree: A Dznamic Scalable kd-Tree , 2003, SSTD.

[44]  Goetz Graefe,et al.  Write-Optimized B-Trees , 2004, VLDB.

[45]  Michael A. Bender,et al.  Cache-oblivious B-trees , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[46]  Tzi-cker Chiueh,et al.  Track-based disk logging , 2002, Proceedings International Conference on Dependable Systems and Networks.

[47]  Tzi-cker Chiueh,et al.  Efficient Logging and Replication Techniques for Comprehensive Data Protection , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[48]  Klaus H. Hinrichs,et al.  Efficient Bulk Operations on Dynamic R-Trees , 1999, Algorithmica.