A new full-text indexing model with low space overhead for chinese text retrieval

Text retrieval systems require an index to allow efficient retrieval of documents at the cost of some storage overhead. This paper proposes a novel full-text indexing model for Chinese text retrieval based on the concept of adjacency matrix of directed graph. Using this indexing model, on one hand, retrieval systems need to keep only the indexing data, instead of the indexing data and the original text data as the traditional retrieval systems always do. On the other hand, occurrences of index term are identified by labels of the so-called s-strings where the index term appears, rather than by its positions as in traditional indexing models. Consequently, system space cost as a whole can be reduced drastically while retrieval efficiency is maintained satisfactory. Experiments over several real-world Chinese text collections are carried out to demonstrate the effectiveness and efficiency of this model. In addition to Chinese, The proposed indexing model is also effective and efficient for text retrieval of other Oriental languages, such as Japanese and Korean. It is especially useful for digital library application areas where storage resource is very limited (e.g., e-books and CD-based text retrieval systems).

[1]  Nina Wacholder,et al.  Spotting and Discovering Terms Through Natural Language Processing , 2003, Information Retrieval.

[2]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[3]  Zhou Shui-geng Adjacency Matrix Based Full-Text Indexing Models , 2002 .

[4]  Suh-Yin Lee,et al.  Optimal weight assignment for a Chinese signature file , 1996 .

[5]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[6]  Alistair Moffat,et al.  An Efficient Indexing Technique for Full Text Databases , 1992, Very Large Data Bases Conference.

[7]  Yasushi Ogawa,et al.  A New Indexing and Text Ranking Method for Japanese Text Databases Using Simple-Word Compounds as Keywords , 1993, DASFAA.

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Yasushi Ogawa,et al.  Simple word strings as compound keywords: an indexing and ranking method for Japanese texts , 1993, SIGIR.

[10]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings , 2001, Information Retrieval.

[11]  David M. Pennock,et al.  Analysis of lexical signatures for finding lost or related documents , 2002, SIGIR '02.

[12]  Ron Sacks-Davis,et al.  An e cient indexing technique for full-text database systems , 1992, VLDB 1992.

[13]  Fredric C. Gey,et al.  Chinese text retrieval without using a dictionary , 1997, SIGIR '97.

[14]  Gwyneth Tseng,et al.  Chinese text segmentation for text retrieval: achievements and problems , 1993 .

[15]  Stephen E. Robertson,et al.  Using self-supervised word segmentation in Chinese information retrieval , 2002, SIGIR '02.

[16]  Ogawa Yasushi,et al.  A new character-based indexing method using frequency data for Japanese documents , 1995, SIGIR 1995.

[17]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[18]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[19]  Kui-Lam Kwok Comparing representations in Chinese information retrieval , 1997, SIGIR '97.

[20]  Lee-Feng Chien Fast and quasi-natural language search for gigabytes of Chinese texts , 1995, SIGIR '95.