Enabling Efficient Random Access to Hierarchically-Compressed Data

Recent studies have shown the promise of direct data processing on hierarchically-compressed text documents. By removing the need for decompressing data, the direct data processing technique brings large savings in both time and space. However, its benefits have been limited to data traversal operations; for random accesses, direct data processing is several times slower than the state-of-the-art baselines. This paper presents a set of techniques that successfully eliminate the limitation, and for the first time, establishes the feasibility of effectively handling both data traversal operations and random data accesses on hierarchically-compressed data. The work yields a new library, which achieves 3.1 × speedup over the state-of-the-art on random data accesses to compressed data, while preserving the capability of supporting traversal operations efficiently and providing large (3.9 ×) space savings.

[1]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[2]  Ion Stoica,et al.  BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores , 2016, NSDI.

[3]  J. Larus Whole program paths , 1999, PLDI '99.

[4]  Ian H. Witten,et al.  Linear-time, incremental hierarchy inference for compression , 1997, Proceedings DCC '97. Data Compression Conference.

[5]  Quanzhong Li,et al.  Supporting efficient query processing on compressed XML files , 2005, SAC '05.

[6]  Wenguang Chen,et al.  Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights , 2018, Proc. VLDB Endow..

[7]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[8]  Sheeva Afshan,et al.  Using compression algorithms to support the comprehension of program traces , 2010, WODA '10.

[9]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[10]  Gonzalo Navarro,et al.  Compact Data Structures - A Practical Approach , 2016 .

[11]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[12]  Edward A. Lee,et al.  The Cloud is Not Enough: Saving IoT from the Cloud , 2015, HotStorage.

[13]  I. Sim,et al.  Physicians' use of electronic medical records: barriers and solutions. , 2004, Health affairs.

[14]  Wenguang Chen,et al.  Zwift: A Programming Framework for High Performance Text Analytics on Compressed Data , 2018, ICS.

[15]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[16]  Martin Hirzel,et al.  Dynamic hot data stream prefetching for general-purpose programs , 2002, PLDI '02.

[17]  Subhajit Roy,et al.  Identifying Hierarchical Structures in Sequences on GPU , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[18]  C. Lee Giles,et al.  Context and Page Analysis for Improved Web Search , 1998, IEEE Internet Comput..

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .

[21]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.

[22]  Gregg Rothermel,et al.  Whole program path-based dynamic impact analysis , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[23]  Brad Calder,et al.  Motivation for Variable Length Intervals and Hierarchical Phase Behavior , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[24]  Ion Stoica,et al.  ZipG: A Memory-efficient Graph Store for Interactive Queries , 2017, SIGMOD Conference.

[25]  Ion Stoica,et al.  Succinct: Enabling Queries on Compressed Data , 2015, NSDI.

[26]  Stephan Vogel,et al.  Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[27]  P. Winn,et al.  Online Court Records: Balancing Judicial Accountability and Privacy in an Age of Electronic Information , 2004 .