Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights

Today’s rapidly growing document volumes pose pressing challenges to modern document analytics, in both space usage and processing time. In this work, we propose the concept of compression-based direct processing to alleviate issues in both dimensions. The main idea is to enable direct document analytics on compressed data. We present how the concept can be materialized on Sequitur, a compression algorithm that produces hierarchical grammar-like representations. We discuss the major challenges in applying the idea to various document analytics tasks, and reveal a set of guidelines and also assistant software modules for developers to effectively apply compression-based direct processing. Experiments show that our proposed techniques save 90.8% storage space and 77.5% memory usage, while speeding up data processing significantly, i.e., by 1.6X on sequential systems, and 2.2X on distributed clusters, on average. PVLDB Reference Format: Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, Wenguang Chen. Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights. PVLDB, 11(11): 1522-1535, 2018. DOI: https://doi.org/10.14778/3236187.3236203

[1]  Fabio Petroni,et al.  HDRF: Stream-Based Partitioning for Power-Law Graphs , 2015, CIKM.

[2]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[3]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[4]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .

[5]  Jan O. Pedersen,et al.  Optimization for dynamic inverted index maintenance , 1989, SIGIR '90.

[6]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[7]  Ion Stoica,et al.  Succinct: Enabling Queries on Compressed Data , 2015, NSDI.

[8]  Gonzalo Navarro,et al.  Compact Data Structures - A Practical Approach , 2016 .

[9]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[10]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[11]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[12]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[13]  Onur Mutlu,et al.  Potential of A Method for Text Analytics Directly on Compressed Data , 2017 .

[14]  Sheeva Afshan,et al.  Using compression algorithms to support the comprehension of program traces , 2010, WODA '10.

[15]  Samuel Madden,et al.  Processing Analytical Queries over Encrypted Data , 2013, Proc. VLDB Endow..

[16]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[17]  Hao Tang,et al.  Provenance graph query method based on double layer index structure , 2017 .

[18]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[19]  Seyong Lee,et al.  PUMA: Purdue MapReduce Benchmarks Suite , 2012 .

[20]  Ian H. Witten,et al.  Linear-time, incremental hierarchy inference for compression , 1997, Proceedings DCC '97. Data Compression Conference.

[21]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[22]  J. Larus Whole program paths , 1999, PLDI '99.

[23]  Gregg Rothermel,et al.  Whole program path-based dynamic impact analysis , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[24]  Brad Calder,et al.  Motivation for Variable Length Intervals and Hierarchical Phase Behavior , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[25]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[26]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[27]  Joshua Evan Blumenstock,et al.  Size matters: word count as a measure of quality on wikipedia , 2008, WWW.

[28]  Meng He,et al.  Indexing Compressed Text , 2003 .

[29]  Xipeng Shen,et al.  Generalizations of the theory and deployment of triangular inequality for compiler-based strength reduction , 2017, PLDI.

[30]  Uri Zernik,et al.  Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[31]  Martin Hirzel,et al.  Dynamic hot data stream prefetching for general-purpose programs , 2002, PLDI '02.

[32]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[33]  Abraham Silberschatz,et al.  Operating System Concepts Essentials , 2010 .

[34]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[35]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[36]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[37]  Claudio Martella,et al.  Practical Graph Analytics with Apache Giraph , 2015, Apress.

[38]  황규영,et al.  Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems , 2002 .

[39]  Quanzhong Li,et al.  Supporting efficient query processing on compressed XML files , 2005, SAC '05.

[40]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[41]  Ludovic Lebart Classification problems in text analysis and information retrieval , 1998 .

[42]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[43]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[44]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[45]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[46]  Wenguang Chen,et al.  Zwift: A Programming Framework for High Performance Text Analytics on Compressed Data , 2018, ICS.

[47]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[48]  Roberto Grossi,et al.  When indexing equals compression: experiments with compressing suffix arrays and applications , 2004, SODA '04.

[49]  Bradford Nichols,et al.  Pthreads programming - a POSIX standard for better multiprocessing , 1996 .