Clustering and classification of large document bases in a parallel environment

Development of cluster-based search systems has been hampered by prohibitive times involved in clustering large document sets. Once completed, maintaining cluster organizations is difficult in dynamic file environments. We propose the use of parallel computing systems to overcome the computationally intense clustering process. Two operations are examined. The first is clustering a document set and the second is classifying the document set. A subset of the TIPSTER corpus, specifically, articles from the Wall Street Journal, is used. Document set classification was performed without the large storage requirement (potentially as high as 522M) for ancillary data matrices. In all cases, the time performance of the parallel system was an improvement over sequential system times, and produced the same clustering and classification scheme. Some results show near linear speed up in higher threshold clustering applications.