A Survey of Automatic Indexing Techniques for Thai Text Documents

With the rapidly increasing number of Thai text documents available in digital media and websites, it is important to find an efficient text indexing technique to facilitate search and retrieval. An efficient index would speed up the response time and improve the accessibility of the documents. Up to now, not much research in Thai text indexing has been conducted as compared to more commonly used languages like English or other European languages. In Thai text indexing, the extraction of indexing terms becomes a main issue because they cannot be specified automatically from text documents, due to the nature of Thai texts being non-segmented. As a result, there are many challenges for indexing Thai text documents. The ma-jority of Thai text indexing techniques can be divided into two main categories: a language-dependent technique and a lan-guage-independent technique as will be described in this paper.

[1]  L. Tyne,et al.  Optimal Weight Assignment for a Chinese Signature File , 1996, Inf. Process. Manag..

[2]  Somchai Prasitjutrakul,et al.  Automatic Indexing for Thai Text with Unknown Words using Trie Structure , 1997 .

[3]  Chuleerat Jaruskulchai Dictionary-based Thai CLIR: Experimental Survey of Thai CLIR , 2001, CLEF.

[4]  Chuleerat Jaruskulchai,et al.  A practical text summarizer by paragraph extraction for Thai , 2003, IRAL.

[5]  Hui Jiao,et al.  Chinese Keyword Extraction Based on N-Gram and Word Co-occurrence , 2007, 2007 International Conference on Computational Intelligence and Security Workshops (CISW 2007).

[6]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[7]  W. B. Cavnar,et al.  Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model , 1994, TREC.

[8]  Yuji Matsumoto,et al.  Chinese-Japanese Cross Language Information Retrieval: A Han Character Based Approach , 2000, ACL 2000.

[9]  Choochart Haruechaiyasak,et al.  LearnLexTo: a machine-learning based word segmentation for indexing Thai texts , 2008, iNEWS '08.

[10]  Jae-Gil Lee,et al.  n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure , 2005, VLDB.

[11]  Surapant Meknavin,et al.  Feature-based Thai Word Segmentation , 1997 .

[12]  W. Bruce Croft,et al.  A comparison of indexing techniques for Japanese text retrieval , 1993, SIGIR.

[13]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[14]  Hugh E. Williams,et al.  Genomic Information Retrieval , 2003, ADC.

[15]  KAWTRAKUL Asanee THUMKANON Chalathip SERIBURI Sapon A Statistical Approach to Thai Word Filtering * , 2003 .

[16]  Hitoshi Isahara,et al.  A Conditional Random Field Framework for Thai Morphological Analysis , 2006, LREC.

[17]  James Mayfield,et al.  Single n-gram stemming , 2003, SIGIR.

[18]  Kui-Lam Kwok Comparing representations in Chinese information retrieval , 1997, SIGIR '97.

[19]  Hitoshi Isahara,et al.  Dictionary-less Search Engine for the Collaborative Database , 2003 .

[20]  Jonathan D. Cohen,et al.  Recursive hashing functions for n-grams , 1997, TOIS.

[21]  William F. Smyth,et al.  Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory , 2006, SPIRE.

[22]  Jonathan D. Cohen,et al.  Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting , 1995, J. Am. Soc. Inf. Sci..

[23]  Toru Matsuda,et al.  Optimizing query evaluation in n-gram indexing , 1998, SIGIR '98.

[24]  Yasushi Ogawa,et al.  A New Indexing and Text Ranking Method for Japanese Text Databases Using Simple-Word Compounds as Keywords , 1993, DASFAA.

[25]  Jeong Soo Ahn,et al.  Using n-grams for Korean text retrieval , 1996, SIGIR '96.

[26]  Virach Sornlertlamvanich,et al.  Character cluster based Thai information retrieval , 2000, IRAL '00.

[27]  Dan Shen,et al.  Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System , 2000, J. Digit. Inf..

[28]  Chuleerat Jaruskulchai,et al.  An automatic indexing for thai text retrieval , 1998 .

[29]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[30]  Elizabeth Shaw Adams A study of trigrams and their feasibility as index terms in a full text information retrieval system , 1992 .

[31]  Wirote Aroonmanakun,et al.  Collocation and Thai Word Segmentation , 2002 .

[32]  Eugene Ching,et al.  Chinese-English dictionary of modern usage , 1972 .

[33]  Lee-Feng Chien Fast and quasi-natural language search for gigabytes of Chinese texts , 1995, SIGIR '95.

[34]  A. Kawtrakul,et al.  Towards automatic multilevel indexing for Thai text information retrieval , 1998, IEEE. APCCAS 1998. 1998 IEEE Asia-Pacific Conference on Circuits and Systems. Microelectronics and Integrating Systems. Proceedings (Cat. No.98EX242).

[35]  Choochart Haruechaiyasak,et al.  SANSARN LOOK!: A PLATFORM FOR DEVELOPING THAI-LANGUAGE INFORMATION RETRIEVAL SYSTEMS , 2006 .

[36]  Isidore Rigoutsos,et al.  FLASH: a fast look-up algorithm for string homology , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Prasenjit Majumder,et al.  N-gram: a language independent approach to IR and NLP , 2002 .

[38]  Hugh E. Williams,et al.  Indexing and Retrieval for Genomic Databases , 2002, IEEE Trans. Knowl. Data Eng..

[39]  C. Haruechaiyasak,et al.  A comparative study on Thai word segmentation approaches , 2008, 2008 5th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.