Tokenization and N-Gram for Indexing Indonesian Translation of the Quran

Tokenization is an important process used to break the text into parts of a word. N-gram model now is widely used in computational linguistics for predicting the next item in such a contiguous sequence of $\mathbf{n}$ items from a particular sample of text. This paper focuses on the implementation of tokenization and n-gram model using RapidMiner to produce unigram and bigram word for indexing Indonesian Translation of the Quran (ITQ). This study uses ITQ data sets consisting of 114 documents. The methods are data extracting and preprocessing text including tokenization, stemming, stopword removal, transformation cases, and n-grams. The results of this study showed the model produces the 6794 and 60323 tokens combination unigram and bigram use for index ITQ. Significant the contribution of this study is to enhance the digital index of ITQ.

[1]  Muazzam Ahmed Siddiqui,et al.  Discovering the Thematic Structure of the Quran using Probabilistic Topic Model , 2013, 2013 Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences.

[2]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[3]  Sumit Malhotra,et al.  Text Document Tokenization for Word Frequency Count using Rapid miner , 2015 .

[4]  James Mayfield,et al.  N-Gram Morphemes for Retrieval , 2007, CLEF.

[5]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[6]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[7]  Khodijah Hulliyah,et al.  A semantic-based question answering system for indonesian translation of Quran , 2016, iiWAS.

[8]  M. Chidambaram,et al.  Text Mining: Concepts, Applications, Tools and Issues - An Overview , 2013 .

[9]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[10]  Ismail Khalil,et al.  Sentence boundary disambiguation for Indonesian language , 2017, iiWAS.

[11]  Chunyu Kit,et al.  Tokenization as the Initial Phase in NLP , 1992, COLING.

[12]  F. Tala A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia , 2003 .

[13]  Asep Fajar Firmansyah,et al.  Generating weighted vector for concepts in indonesian translation of Quran , 2016, iiWAS.

[14]  Christopher D. Manning,et al.  Advances in natural language processing , 2015, Science.

[15]  Thomas S. Morton,et al.  Taming Text: How to Find, Organize, and Manipulate It , 2013 .

[16]  Hugh E. Williams,et al.  Stemming Indonesian , 2005, ACSC.

[17]  Asep Fajar Firmansyah,et al.  A rule-based question answering system on relevant documents of Indonesian Quran Translation , 2014, 2014 International Conference on Cyber and IT Service Management (CITSM).

[18]  Felix Jungermann,et al.  Information Extraction with RapidMiner , 2015 .

[19]  Vito Pirrelli,et al.  Arabic Natural Language Processing: Models, systems and applications , 2017, J. King Saud Univ. Comput. Inf. Sci..

[20]  Yang Liu,et al.  Joint Tokenization and Translation , 2010, COLING.

[21]  Syopiansyah Jaya Putra,et al.  Context for the intelligent search of information , 2017, 2017 5th International Conference on Cyber and IT Service Management (CITSM).

[22]  Teddy Mantoro,et al.  Text mining for Indonesian translation of the Quran: A systematic review , 2017, 2017 International Conference on Computing, Engineering, and Design (ICCED).

[23]  Fabrizio Sebastiani Text Categorization , 2005, Encyclopedia of Database Technologies and Applications.

[24]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .