Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases

A new algorithm for document clustering is introduced. The base concept of the algorithm, the cover coefficient (CC) concept, provides a means of estimating the number of clusters within a document database and related indexing and clustering analytically. The CC concept is used also to identify the cluster seeds and to form clusters with these seeds. It is shown that the complexity of the clustering process is very low. The retrieval experiments show that the information-retrieval effectiveness of the algorithm is compatible with a very demanding complete linkage clustering method that is known to have good retrieval performance. The experiments also show that the algorithm is 15.1 to 63.5 (with an average of 47.5) percent better than four other clustering algorithms in cluster-based information retrieval. The experiments have validated the indexing-clustering relationships and the complexity of the algorithm and have shown improvements in retrieval effectiveness. In the experiments two document databases are used: TODS214 and INSPEC. The latter is a common database with 12,684 documents.

[1]  Robert T. Dattola Experiments with a fast algorithm for automatic classification , 1971 .

[2]  Forest L. Miller Basic Concepts of Probability and Statistics , 1973 .

[3]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[4]  Gerard Salton,et al.  Dynamic information and library processing , 1975 .

[5]  Donald B. Crouch,et al.  A file organization and maintenance procedure for dynamic document collections , 1975, Inf. Process. Manag..

[6]  Robert G. Crawford,et al.  The computation of discrimination values , 1975, Inf. Process. Manag..

[7]  S. B. Yao,et al.  Approximating block accesses in database organizations , 1977, CACM.

[8]  W. Bruce Croft Clustering large files of documents using the single-link method , 1977, J. Am. Soc. Inf. Sci..

[9]  Gerard Salton,et al.  Generation and search of clustered files , 1978, TODS.

[10]  Peter Willett,et al.  Indexing exhaustivity and the computation of similarity matrices , 1980, J. Am. Soc. Inf. Sci..

[11]  Matti Jakobsson,et al.  Reducing block accesses in inverted files by partial clustering , 1980, Inf. Syst..

[12]  Robert G. Crawford The relational model in information retrieval , 1981, J. Am. Soc. Inf. Sci..

[13]  Fazli Can,et al.  A clustering scheme , 1983, SIGIR '83.

[14]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[15]  Esen A. Ozkarahan,et al.  Two partitioning type clustering algorithms , 1984, J. Am. Soc. Inf. Sci..

[16]  Vijay V. Raghavan,et al.  Organization of clustered files for consecutive retrieval , 1984, TODS.

[17]  Peter Willett,et al.  Hierarchic Agglomerative Clustering Methods for Automatic Document Classification , 1984, J. Documentation.

[18]  Fazli Can,et al.  Concepts of the cover coefficient-based clustering methodology , 1985, SIGIR '85.

[19]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.

[20]  Fazli Can,et al.  Similarity and stability analysis of the two partitioning type clustering algorithms , 1985, J. Am. Soc. Inf. Sci..

[21]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[22]  Ellen M. Voorhees,et al.  The efficiency of inverted index and cluster searches , 1986, SIGIR '86.

[23]  Esen A. Ozkarahan Database machines and database management , 1986 .

[24]  Fazli Can,et al.  An automatic and tunable document indexing system , 1986, SIGIR '86.

[25]  Vijay V. Raghavan,et al.  User-oriented document clustering: a framework for learning in information retrieval , 1986, SIGIR '86.

[26]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..

[27]  Peter Willett,et al.  Techniques for the measurement of clustering tendency in document retrieval systems , 1987, J. Inf. Sci..

[28]  Fazli Can,et al.  A dynamic cluster maintenance system for information retrieval , 1987, SIGIR '87.

[29]  Edie M. Rasmussen,et al.  Non-hierarchical document clustering using the ICL distribution array processor , 1987, SIGIR '87.

[30]  Andrew Kusiak,et al.  An Efficient Cluster Identification Algorithm , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[31]  Fazli Can,et al.  Computation of term/document discrimination values by use of the cover coefficient concept , 1987, J. Am. Soc. Inf. Sci..

[32]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[33]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[34]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[35]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[36]  Peter Willett,et al.  Comparison of Hierarchie Agglomerative Clustering Methods for Document Retrieval , 1989, Comput. J..

[37]  Fazli Can,et al.  Dynamic cluster maintenance , 1989, Inf. Process. Manag..

[38]  Ellen M. Vdorhees The cluster hypothesis revisited , 1985, SIGIR 1985.