Large scale document inversion using a multi-threaded computing system

Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel co-processor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2--3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews.

[1]  Peter K. Pearson,et al.  Fast hashing of variable-length text strings , 1990, CACM.

[2]  Margo I. Seltzer,et al.  A New Hashing Package for UNIX , 1991, USENIX Winter.

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[5]  Gordon W. Braudaway,et al.  Workload characterization and optimization of high-performance text indexing on the Cell Broadband Engine™ (Cell/B.E.) , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Keith Bostic,et al.  Engineering Radix Sort , 1993, Comput. Syst..

[7]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[8]  Max Crochemore,et al.  Algorithms and Theory of Computation Handbook , 2010 .

[9]  Golden G. Richard,et al.  Massive threading: Using GPUs to increase the performance of digital forensics tools , 2007, Digit. Investig..

[10]  George Havas,et al.  Perfect Hashing , 1997, Theor. Comput. Sci..

[11]  Mladen Berekovic,et al.  Small-ruleset regular expression matching on GPGPUs: quantitative performance analysis and optimization , 2010, ICS '10.

[12]  Vijay K. Garg,et al.  Highly scalable algorithm for distributed real-time text indexing , 2009, 2009 International Conference on High Performance Computing (HiPC).

[13]  Mustapha Chérif-Eddine Yagoub,et al.  A novel approach for indexing Arabic documents through GPU computing , 2012, 2012 25th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE).

[14]  Hiroyuki Yamada,et al.  Scalable online index construction with multi-core CPUs , 2010, ADC.

[15]  Shane Ryoo,et al.  Program Optimization Strategies for Data-Parallel Many-Core Processors , 2008 .