G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression

Text analytics directly on compression (TADOC) has proven to be a promising technology for big data analytics. GPUs are extremely popular accelerators for data analytics systems. Unfortunately, no work so far shows how to utilize GPUs to accelerate TADOC. We describe G-TADOC, the first framework that provides GPU-based text analytics directly on compression, effectively enabling efficient text analytics on GPUs without decompressing the input data.G-TADOC solves three major challenges. First, TADOC involves a large amount of dependencies, which makes it difficult to exploit massive parallelism on a GPU. We develop a novel fine-grained thread-level workload scheduling strategy for GPU threads, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, in developing G-TADOC, thousands of GPU threads writing to the same result buffer leads to inconsistency while directly using locks and atomic operations lead to large synchronization overheads. We develop a memory pool with thread-safe data structures on GPUs to handle such difficulties. Third, maintaining the sequence information among words is essential for lossless compression. We design a sequence-support strategy, which maintains high GPU parallelism while ensuring sequence information.Our experimental evaluations show that G-TADOC provides 31.1× average speedup compared to state-of-the-art TADOC.

[1]  Yanwei Zheng,et al.  Mining Hard Samples Globally and Efficiently for Person Reidentification , 2020, IEEE Internet of Things Journal.

[2]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[3]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[4]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[5]  Artur Jez,et al.  Balancing Straight-Line Programs , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[6]  Xiaoyong Du,et al.  Exploring deep reuse in winograd CNN inference , 2021, PPoPP.

[7]  Xiaoyong Du,et al.  An Efficient Parallel Secure Machine Learning Framework on GPUs , 2021, IEEE Transactions on Parallel and Distributed Systems.

[8]  Xin Rong,et al.  word2vec Parameter Learning Explained , 2014, ArXiv.

[9]  Christopher Root,et al.  MapD: a GPU-powered big data analytics and visualization platform , 2016, SIGGRAPH Talks.

[10]  Ion Stoica,et al.  Succinct: Enabling Queries on Compressed Data , 2015, NSDI.

[11]  Giovanni Manzini,et al.  An experimental study of a compressed index , 2001, Inf. Sci..

[12]  Wenguang Chen,et al.  Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights , 2018, Proc. VLDB Endow..

[13]  John D. Owens,et al.  Gunrock , 2017, ACM Trans. Parallel Comput..

[14]  Raffaele Perego,et al.  Compressed Indexes for Fast Search of Semantic Data , 2019, IEEE Transactions on Knowledge and Data Engineering.

[15]  Siu-Ming Yiu,et al.  Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences , 2004, ALENEX/ANALC.

[16]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[17]  Rossano Venturini,et al.  Techniques for Inverted Index Compression , 2019, ACM Comput. Surv..

[18]  J. Shane Culpepper,et al.  The Potential of Learned Index Structures for Index Compression , 2018, ADCS.

[19]  Gonzalo Navarro,et al.  GraCT: A Grammar-based Compressed Index for Trajectory Data , 2019, Inf. Sci..

[20]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999, Softw. Pract. Exp..

[21]  Alexander L. Wolf,et al.  SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures , 2016, SIGMOD Conference.

[22]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[23]  Roberto Grossi,et al.  When indexing equals compression: experiments with compressing suffix arrays and applications , 2004, SODA '04.

[24]  Meng He,et al.  Indexing Compressed Text , 2003 .

[25]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.

[26]  Philip Bille,et al.  Finger Search in Grammar-Compressed Strings , 2015, Theory of Computing Systems.

[27]  Sujatha R. Upadhyaya,et al.  Parallel approaches to machine learning - A comprehensive survey , 2013, J. Parallel Distributed Comput..

[28]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[29]  Feng Zhang,et al.  Enabling Efficient Random Access to Hierarchically-Compressed Data , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[30]  Paolo Ferragina,et al.  Bicriteria Data Compression: Efficient and Usable , 2014, ESA.

[31]  Gad M. Landau,et al.  Random Access to Grammar-Compressed Strings and Trees , 2015, SIAM J. Comput..

[32]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[33]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[34]  Bingsheng He,et al.  FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures , 2020, USENIX Annual Technical Conference.

[35]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[36]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[37]  Alistair Moffat,et al.  Compact inverted index storage using general‐purpose compression libraries , 2018, Softw. Pract. Exp..

[38]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[39]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[40]  Wojciech Rytter,et al.  Grammar Compression, LZ-Encodings, and String Algorithms with Implicit Input , 2004, ICALP.

[41]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[42]  Alistair Moffat,et al.  Fast Dictionary-Based Compression for Inverted Indexes , 2019, WSDM.

[43]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[44]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[45]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[46]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[47]  Kunihiko Sadakane,et al.  Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[48]  Alistair Moffat,et al.  Index Compression Using Byte-Aligned ANS Coding and Two-Dimensional Contexts , 2018, WSDM.

[49]  Hiroshi Sakamoto,et al.  A Space-Optimal Grammar Compression , 2017, ESA.

[50]  Onur Mutlu,et al.  TADOC: Text analytics directly on compression , 2020, The VLDB Journal.

[51]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[52]  Ian H. Witten,et al.  Linear-time, incremental hierarchy inference for compression , 1997, Proceedings DCC '97. Data Compression Conference.

[53]  John D. Owens,et al.  Multi-GPU Graph Analytics , 2015, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[54]  Wenguang Chen,et al.  Zwift: A Programming Framework for High Performance Text Analytics on Compressed Data , 2018, ICS.

[55]  J. Shane Culpepper,et al.  Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study , 2019, ECIR.

[56]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .

[57]  Gonzalo Navarro,et al.  Compact Data Structures - A Practical Approach , 2016 .

[58]  Rubao Lee,et al.  Spark-GPU: An accelerated in-memory data processing engine on clusters , 2016, 2016 IEEE International Conference on Big Data (Big Data).