Efficient Clustering of Very Large Document Collections

An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory efficient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented — a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[5]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[6]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[7]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[8]  Michael Randolph Garey,et al.  The complexity of the generalized Lloyd - Max problem , 1982, IEEE Trans. Inf. Theory.

[9]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[10]  Brent Callaghan,et al.  NFS Illustrated , 1999 .

[11]  John G. Lewis,et al.  Sparse matrix test problems , 1982, SGNM.

[12]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[13]  David R. Musser,et al.  STL tutorial and reference guide , 2001 .

[14]  Bradford Nichols,et al.  Pthreads programming , 1996 .

[15]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[16]  Tamara G. Kolda,et al.  Limited-memory matrix methods with applications , 1997 .

[17]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[18]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[19]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.