论文信息 - Large scale document inversion using a multi-threaded computing system

Large scale document inversion using a multi-threaded computing system

Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel co-processor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2--3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews.

[1] Peter K. Pearson,et al. Fast hashing of variable-length text strings , 1990, CACM.

[2] Margo I. Seltzer,et al. A New Hashing Package for UNIX , 1991, USENIX Winter.

[3] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[4] Anton van den Hengel,et al. Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[5] Gordon W. Braudaway,et al. Workload characterization and optimization of high-performance text indexing on the Cell Broadband Engine™ (Cell/B.E.) , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6] Keith Bostic,et al. Engineering Radix Sort , 1993, Comput. Syst..

[7] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[8] Max Crochemore,et al. Algorithms and Theory of Computation Handbook , 2010 .

[9] Golden G. Richard,et al. Massive threading: Using GPUs to increase the performance of digital forensics tools , 2007, Digit. Investig..

[10] George Havas,et al. Perfect Hashing , 1997, Theor. Comput. Sci..

[11] Mladen Berekovic,et al. Small-ruleset regular expression matching on GPGPUs: quantitative performance analysis and optimization , 2010, ICS '10.

[12] Vijay K. Garg,et al. Highly scalable algorithm for distributed real-time text indexing , 2009, 2009 International Conference on High Performance Computing (HiPC).

[13] Mustapha Chérif-Eddine Yagoub,et al. A novel approach for indexing Arabic documents through GPU computing , 2012, 2012 25th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE).

[14] Hiroyuki Yamada,et al. Scalable online index construction with multi-core CPUs , 2010, ADC.

[15] Shane Ryoo,et al. Program Optimization Strategies for Data-Parallel Many-Core Processors , 2008 .