A New Approach to Clustering Records in Information Retrieval Systems

This work introduces a new approach to record clustering where a hybrid algorithm is presented to cluster records based upon threshold values and the query patterns made to a particular database. The Hamming Distance of a file is used as a measure of space density. The objective of the algorithm is to minimize the Hamming Distance of the file while attaching significance to the most frequent queries being asked. Simulation experiments conducted proved that a great reduction in response time is yielded after the restructuring of a file. We study the space density properties of a file and how it affects retrieval time before and after clustering, as a means of predicting file performance and making appropriate choices of parameters. Criteria, such as, block size, threshold value, percentage of records satisfying a given set of queries, etc., which affect clustering and response time are also studied.

[1]  Clement T. Yu,et al.  Adaptive record clustering , 1985, TODS.

[2]  Uwe Deppisch,et al.  S-tree: a dynamic balanced signature index for office retrieval , 1986, SIGIR '86.

[3]  Gerard Salton,et al.  Dynamic information and library processing , 1975 .

[4]  H. Buchner The Grid File : An Adaptable , Symmetric Multikey File Structure , 2001 .

[5]  W. Bruce Croft Clustering large files of documents using the single-link method , 1977, J. Am. Soc. Inf. Sci..

[6]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[7]  Barry G. T. Lowden An approach to multikey sequencing in an equiprobable keyterm retrieval situation , 1985, SIGIR '85.

[8]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[9]  Michael L. Mauldin,et al.  Conceptual Information Retrieval: A Case Study in Adaptive Partial Parsing , 1991 .

[10]  Richard C. T. Lee,et al.  Storage Reduction Through Minimal Spanning Trees and Spanning Forests , 1977, IEEE Transactions on Computers.

[11]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[12]  Clement T. Yu,et al.  On the estimation of the number of desired records with respect to a given query , 1978, TODS.

[13]  Vijay V. Raghavan,et al.  On modeling of information retrieval concepts in vector spaces , 1987, TODS.

[14]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[15]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[16]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[17]  Per-Åke Larson,et al.  Linear Hashing with Partial Expansions , 1980, VLDB.