Using N-gram and Frequent Max Substring Techniques for Index-Term Extraction from Non-Segmented Texts: A Comparison of Two Techniques

The amount of electronically stored information in non-segmented texts has grown rapidly and the number of these documents is still increasing. This makes index-term extraction an essential task and some techniques have been proposed for extracting index-terms from non-segmented texts in order to support indexing. In this paper, we investigate two index-term extraction techniques: n- gram and frequent max substring techniques for non-segmented texts. Many research communities have acknowledged that the n-gram technique is one of the viable solutions for extracting index-terms in non- segmented texts such as Chinese, Japanese, Korea, Thai languages and genome or protein in area of bioinformatics. Beside this, the frequent max substring technique has been proposed as an alternative method to extract index-terms. This technique provides significant benefits for indexing non-segmented texts. In this paper, experimental studies and comparison results are shown in order to compare two techniques. From the experimental results, the following observations can be made. The n-gram technique requires less space to extract the index-terms when compare to the frequent max substring technique. Meanwhile, the frequent max substring technique has improved over the n-gram technique in term of performance as it can be applied to many non-segmented texts without the requirement of determining the dimensions of the term.

[1]  Jeong Soo Ahn,et al.  Using n-grams for Korean text retrieval , 1996, SIGIR '96.

[2]  Yasushi Ogawa,et al.  A New Indexing and Text Ranking Method for Japanese Text Databases Using Simple-Word Compounds as Keywords , 1993, DASFAA.

[3]  James Mayfield,et al.  Single n-gram stemming , 2003, SIGIR.

[4]  Kui-Lam Kwok Comparing representations in Chinese information retrieval , 1997, SIGIR '97.

[5]  Lee-Feng Chien Fast and quasi-natural language search for gigabytes of Chinese texts , 1995, SIGIR '95.

[6]  W. Bruce Croft,et al.  A comparison of indexing techniques for Japanese text retrieval , 1993, SIGIR.

[7]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[8]  Toru Matsuda,et al.  Optimizing query evaluation in n-gram indexing , 1998, SIGIR '98.

[9]  Lee Jae-Gil,et al.  n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure , 2006 .

[10]  Hong Xie,et al.  An automatic indexing technique for Thai texts using frequent max substring , 2009, 2009 Eighth International Symposium on Natural Language Processing.

[11]  Kevin Kok Wai Wong,et al.  Non-segmented Document Clustering Using Self-Organizing Map and Frequent Max Substring Technique , 2009, ICONIP.

[12]  Suh-Yin Lee,et al.  Optimal weight assignment for a Chinese signature file , 1996 .

[13]  Dan Shen,et al.  Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System , 2000, J. Digit. Inf..

[14]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[15]  Prasenjit Majumder,et al.  N-gram: a language independent approach to IR and NLP , 2002 .

[16]  Hugh E. Williams,et al.  Indexing and Retrieval for Genomic Databases , 2002, IEEE Trans. Knowl. Data Eng..

[17]  Hugh E. Williams,et al.  Genomic Information Retrieval , 2003, ADC.

[18]  Hong Xie,et al.  Thai text mining to support Web search for E-commerce , 2008 .

[19]  Dan J. Smith,et al.  Information extraction for Thai documents , 2000, IRAL '00.

[20]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[21]  Elizabeth Shaw Adams A study of trigrams and their feasibility as index terms in a full text information retrieval system , 1992 .