Incremental Clustering for Very Large Document Databases: Initial MARIAN Experience

Clustering of document databases is useful for both browsing and searching purposes; however, this can be a prohibitively expensive computational process for large collections. This problem is compounded when the clustering structure must reflect a constantly changing database. Therefore, efficient algorithms which maintain an existing clustering structure are desirable. This study provides the details of a large-scale implementation of the Cover-Coefficient-based Incremental Clustering Methodology (C2ICM). The experiments performed on a sample of the MARIAN database show that its resource requirements are within practical bounds for most platforms. Furthermore, C2ICM) offers considerable savings over reclustering. The results of this study will lead to an additional type of browsing and/or searching facility on the Virginia Tech-based MARIAN large online public access library catalog (OPAC) project.