A novel approach for indexing Arabic documents through GPU computing

In contrast to English search engines, Arabic search engines did not have their fair share in modern studies despite the continuous growth of Arabic Internet users and data. Towards bridging the gap, this paper presents a novel indexing algorithm customized for Arabic documents. Our algorithm exploits the characteristics of the Arabic language to enhance indexing and lookup. Additionally, the algorithm utilizes the highly parallel architecture of the graphics processing unit to speed-up the indexing. Finally, we discuss some of the synchronization challenges we faced and the techniques we used to overcome them. The preliminary tests of our GPU-accelerated Arabic indexer show promising speed-up factors.

[1]  Amine Bensaid,et al.  Barq: distributed multilingual internet search engine with focus on Arabic language , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[2]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[3]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[4]  Sriram Raghavan,et al.  Building a distributed full-text index for the Web , 2001, WWW '01.

[5]  John D. Owens,et al.  Efficient Synchronization Primitives for GPUs , 2011, ArXiv.

[6]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[7]  David Kaeli,et al.  Heterogeneous Computing with OpenCL , 2011 .

[8]  Craig MacDonald,et al.  Comparing Distributed Indexing: To MapReduce or Not? , 2009, LSDS-IR@SIGIR.

[9]  El-Sayed M. El-Horbaty,et al.  GPU-Accelerated Light Stemmer for the Arabic Language , 2012 .