TADOC: Text analytics directly on compression

This article provides a comprehensive description of Text Analytics Directly on Compression (TADOC), which enables direct document analytics on compressed textual data. The article explains the concept of TADOC and the challenges to its effective realizations. Additionally, a series of guidelines and technical solutions that effectively address those challenges, including the adoption of a hierarchical compression method and a set of novel algorithms and data structure designs, are presented. Experiments on six data analytics tasks of various complexities show that TADOC can save 90.8% storage space and 87.9% memory usage, while halving data processing times.

[1]  Alistair Moffat,et al.  Fast Dictionary-Based Compression for Inverted Indexes , 2019, WSDM.

[2]  Ion Stoica,et al.  BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores , 2016, NSDI.

[3]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[4]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[5]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[6]  Xin Rong,et al.  word2vec Parameter Learning Explained , 2014, ArXiv.

[7]  SadakaneKunihiko Compressed Suffix Trees with Full Functionality , 2007 .

[8]  Ion Stoica,et al.  Succinct: Enabling Queries on Compressed Data , 2015, NSDI.

[9]  황규영,et al.  Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems , 2002 .

[10]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[11]  Giovanni Manzini,et al.  An experimental study of a compressed index , 2001, Inf. Sci..

[12]  Hayato Ohwada,et al.  Extraction of disease-related genes from PubMed paper using word2vec , 2017, CSBio.

[13]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[14]  Paolo Ferragina,et al.  Bicriteria Data Compression: Efficient and Usable , 2014, ESA.

[15]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[16]  Joshua Evan Blumenstock,et al.  Size matters: word count as a measure of quality on wikipedia , 2008, WWW.

[17]  Wojciech Rytter,et al.  Grammar Compression, LZ-Encodings, and String Algorithms with Implicit Input , 2004, ICALP.

[18]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[19]  Gad M. Landau,et al.  Random Access to Grammar-Compressed Strings and Trees , 2015, SIAM J. Comput..

[20]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[21]  Rachata Ausavarungnirun,et al.  Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Onur Mutlu,et al.  Linearly compressed pages: A low-complexity, low-latency main memory compression framework , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[25]  Brad Calder,et al.  Motivation for Variable Length Intervals and Hierarchical Phase Behavior , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[26]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[27]  J. Larus Whole program paths , 1999, PLDI '99.

[28]  Rossano Venturini,et al.  Techniques for Inverted Index Compression , 2019, ACM Comput. Surv..

[29]  J. Shane Culpepper,et al.  The Potential of Learned Index Structures for Index Compression , 2018, ADCS.

[30]  Gonzalo Navarro,et al.  GraCT: A Grammar-based Compressed Index for Trajectory Data , 2019, Inf. Sci..

[31]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999, Softw. Pract. Exp..

[32]  Bradford Nichols,et al.  Pthreads programming - a POSIX standard for better multiprocessing , 1996 .

[33]  Ian H. Witten,et al.  Linear-time, incremental hierarchy inference for compression , 1997, Proceedings DCC '97. Data Compression Conference.

[34]  Roberto Grossi,et al.  When indexing equals compression: experiments with compressing suffix arrays and applications , 2004, SODA '04.

[35]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[36]  Igor Popov,et al.  Malware detection using machine learning based on word2vec embeddings of machine code instructions , 2017, 2017 Siberian Symposium on Data Science and Engineering (SSDSE).

[37]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[38]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[39]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[40]  Mamta Sharma,et al.  Compression Using Huffman Coding , 2010 .

[41]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[42]  Alistair Moffat,et al.  Index Compression Using Byte-Aligned ANS Coding and Two-Dimensional Contexts , 2018, WSDM.

[43]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.

[44]  Wenguang Chen,et al.  Automatic Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures , 2021, IEEE Transactions on Knowledge and Data Engineering.

[45]  Philip Bille,et al.  Finger Search in Grammar-Compressed Strings , 2015, Theory of Computing Systems.

[46]  Ludovic Lebart Classification problems in text analysis and information retrieval , 1998 .

[47]  Seyong Lee,et al.  PUMA: Purdue MapReduce Benchmarks Suite , 2012 .

[48]  Quanzhong Li,et al.  Supporting efficient query processing on compressed XML files , 2005, SAC '05.

[49]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[50]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[51]  Martin Hirzel,et al.  Dynamic hot data stream prefetching for general-purpose programs , 2002, PLDI '02.

[52]  Feng Zhang,et al.  Enabling Efficient Random Access to Hierarchically-Compressed Data , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[53]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[54]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[55]  Artur Jez,et al.  Improvements on Re-Pair Grammar Compressor , 2017, 2017 Data Compression Conference (DCC).

[56]  Fabio Petroni,et al.  HDRF: Stream-Based Partitioning for Power-Law Graphs , 2015, CIKM.

[57]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[58]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[59]  Wenguang Chen,et al.  Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights , 2018, Proc. VLDB Endow..

[60]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[61]  Raffaele Perego,et al.  Compressed Indexes for Fast Search of Semantic Data , 2019, IEEE Transactions on Knowledge and Data Engineering.

[62]  Siu-Ming Yiu,et al.  Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences , 2004, ALENEX/ANALC.

[63]  Wenguang Chen,et al.  Zwift: A Programming Framework for High Performance Text Analytics on Compressed Data , 2018, ICS.

[64]  J. Shane Culpepper,et al.  Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study , 2019, ECIR.

[65]  Anbang Xu,et al.  A New Chatbot for Customer Service on Social Media , 2017, CHI.

[66]  Gregg Rothermel,et al.  Whole program path-based dynamic impact analysis , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[67]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .

[68]  Gonzalo Navarro,et al.  Compact Data Structures - A Practical Approach , 2016 .

[69]  Uri Zernik,et al.  Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[70]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[71]  Alistair Moffat,et al.  Compact inverted index storage using general‐purpose compression libraries , 2018, Softw. Pract. Exp..

[72]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[73]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[74]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[75]  Kunihiko Sadakane,et al.  Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[76]  Paolo Ferragina,et al.  On the Bit-Complexity of Lempel-Ziv Compression , 2009, SIAM J. Comput..

[77]  Alistair Moffat,et al.  Off-line dictionary-based compression , 2000 .

[78]  Hiroshi Sakamoto,et al.  A Space-Optimal Grammar Compression , 2017, ESA.

[79]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[80]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[81]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[82]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[83]  Claudio Martella,et al.  Practical Graph Analytics with Apache Giraph , 2015, Apress.

[84]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[85]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[86]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[87]  Elena Smirnova,et al.  Meta-Prod2Vec: Product Embeddings Using Side-Information for Recommendation , 2016, RecSys.

[88]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[89]  Sheeva Afshan,et al.  Using compression algorithms to support the comprehension of program traces , 2010, WODA '10.

[90]  Oksana Smal,et al.  POLITICAL DISCOURSE CONTENT ANALYSIS: A CRITICAL OVERVIEW OF A COMPUTERIZED TEXT ANALYSIS PROGRAM LINGUISTIC INQUIRY AND WORD COUNT (LIWC) , 2020, Naukovì zapiski Nacìonalʹnogo unìversitetu «Ostrozʹka akademìâ». Serìâ «Fìlologìâ».

[91]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[92]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[93]  Rossano Venturini,et al.  Inverted Index Compression , 2019, Encyclopedia of Big Data Technologies.

[94]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[95]  Zhiyuan Liu,et al.  PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing , 2011, TIST.