Fault-tolerant precise data access on distributed log-structured merge-tree

Log-structured merge tree has been adopted by many distributed storage systems. It decomposes a large database into multiple parts: an in-writing part and several read-only ones. Records are firstly written into a memory-optimized structure and then compacted into in-disk structures periodically. It achieves high write throughput. However, it brings side effect that read requests have to go through multiple structures to find the required record. In a distributed database system, different parts of the LSM-tree are stored in distributed fashion. To this end, a server in the query layer has to issues multiple network communications to pull data items from the underlying storage layer. Coming to its rescue, this work proposes a precise data access strategy which includes: an efficient structure with low maintaining overhead designed to test whether a record exists in the in-writing part of the LSM-tree; a lease-based synchronization strategy proposed to maintain consistent copies of the structure on remote query servers.We further prove the technique is capable of working robustly when the LSM-Tree is re-organizing multiple structures in the backend. It is also fault-tolerant, which is able to recover the structures used in data access after node failures happen. Experiments using the YCSB benchmark show that the solution has 6x throughput improvement over existing methods.

[1]  Liana L. Fong,et al.  Diff-Index: Differentiated Index in Distributed Log-Structured Data Stores , 2014, EDBT.

[2]  Raghu Ramakrishnan,et al.  bLSM: a general purpose log structured merge tree , 2012, SIGMOD Conference.

[3]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[4]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[5]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[6]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[7]  Aoying Zhou,et al.  Fast Log Replication in Highly Available Data Store , 2017, APWeb/WAIM.

[8]  Sudipta Sengupta,et al.  The Bw-Tree: A B-tree for new hardware platforms , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[9]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[10]  Xiaoyong Du,et al.  Big data challenge: a data management perspective , 2013, Frontiers of Computer Science.

[11]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[12]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[13]  Hailong Sun,et al.  An efficient and highly available framework of data recency enhancement for eventually consistent data stores , 2017, Frontiers of Computer Science.

[14]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[15]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[16]  Aoying Zhou,et al.  Precise Data Access on Distributed Log-Structured Merge-Tree , 2017, APWeb/WAIM.

[17]  Michael Stonebraker,et al.  Implementation techniques for main memory database systems , 1984, SIGMOD '84.

[18]  Dennis Shasha,et al.  The dangers of replication and a solution , 1996, SIGMOD '96.

[19]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[20]  Bettina Kemme,et al.  Compaction Management in Distributed Key-Value Datastores , 2015, Proc. VLDB Endow..

[21]  Guy M. Lohman,et al.  Differential files: their application to the maintenance of large databases , 1976, TODS.