Toward Parallel Document Clustering

A key challenge to automated clustering of documents in large text corpora is the high cost of comparing documents in a multi-million dimensional document space. The Anchors Hierarchy is a fast data structure and algorithm for localizing data based on a triangle inequality obeying distance metric, the algorithm strives to minimize the number of distance calculations needed to cluster the documents into "anchors'' around reference documents called "pivots''. We extend the original algorithm to increase the amount of available parallelism and consider two implementations: a complex data structure which affords efficient searching, and a simple data structure which requires repeated sorting. The sorting implementation is integrated with a text corpora "Bag of Words'' program and initial performance results of end-to-end document processing workflow are reported.

[1]  Jörg-Rüdiger Sack,et al.  A Characterization of Heaps and Its Applications , 1990, Inf. Comput..

[2]  Edward A. Fox,et al.  Research Contributions , 2014 .

[3]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[4]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Michalis Vazirgiannis,et al.  A Review of Web Document Clustering Approaches , 2010, Data Mining and Knowledge Discovery Handbook.

[7]  Andrew W. Moore,et al.  The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.

[8]  D. Aldous Exchangeability and related topics , 1985 .

[9]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[10]  John Feo,et al.  Hashing strategies for the Cray XMT , 2009, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).