Efficiency of Hierarchic Agglomerative Clustering using the ICL Distributed array Processor

The implementation of hierarchic agglomerative methods of cluster anlaysis for large datasets is very demanding of computational resources when implemented on conventional computers. The ICL Distributed Array Processor (DAP) allows many of the scanning and matching operations required in clustering to be carried out in parallel. Experiments are described using the single linkage and Ward's hierarchical agglomerative clustering methods on both real and simulated datasets. Clustering runs on the DAP are compared with the most efficient algorithms currently available implemented on an IBM 3083 BX. The DAP is found to be 2.9–7.9 times as fast as the IBM, the exact degree of speed‐up depending on the size of the dataset, the clustering method, and the serial clustering algorithm that is used. An analysis of the cycle times of the two machines is presented which suggests that further, very substantial speed‐ups could be obtained from array processors of this type if they were to be based on more powerful processing elements.

[1]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[2]  Fionn Murtagh,et al.  Multidimensional clustering algorithms , 1985 .

[3]  Jon Louis Bentley,et al.  Fast Algorithms for Constructing Minimal Spanning Trees in Coordinate Spaces , 1978, IEEE Transactions on Computers.

[4]  Peter Willett,et al.  Hierarchic document classification using Ward's clustering method , 1986, SIGIR '86.

[5]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[6]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..

[7]  Peter Willett,et al.  Hierarchic Agglomerative Clustering Methods for Automatic Document Classification , 1984, J. Documentation.

[8]  Michael J. Quinn,et al.  Designing Efficient Algorithms for Parallel Computers , 1987 .

[9]  Dennis Parkinson,et al.  The Measurement of Performance on a Highly Parallel System , 1983, IEEE Transactions on Computers.

[10]  Robert N. Oddy,et al.  Pthomas: An adaptive information retrieval system on the connection machine , 1991, Inf. Process. Manag..

[11]  Robert F. Ling,et al.  Cluster analysis algorithms for data reduction and classification of objects , 1981 .

[12]  V. Whitney Algorithm 422: minimal spanning tree [H] , 1972, CACM.

[13]  Craig Stanfill,et al.  Parallel free-text search on the connection machine system , 1986, CACM.

[14]  M. Raphalen Applying Parallel Processing to Data Analysis: Computing a Distance’s Matrix on a SIMD Machine , 1982 .

[15]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[16]  Peter Willett,et al.  An evaluation of document retrieval from serial files using the ICL Distributed Array Processor , 1984 .

[17]  Edie M. Rasmussen,et al.  Non-hierarchical document clustering using the ICL distribution array processor , 1987, SIGIR '87.

[18]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[19]  D. Parkinson Performance analysis in a 4096 processor environment , 1986, J. Syst. Softw..

[20]  R. W. Gostick Software and Hardware Technology for the ICL Distributed Array Processor , 1981, Aust. Comput. J..

[21]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[22]  R. Prim Shortest connection networks and some generalizations , 1957 .

[23]  Edward W. Davis Application of the massively parallel processor to database management systems , 1983, AFIPS '83.

[24]  Peter Willett,et al.  Use of text signatures for document retrieval in a highly parallel environment , 1987, Parallel Comput..

[25]  Anil K. Jain,et al.  Clustering Methodologies in Exploratory Data Analysis , 1980, Adv. Comput..

[26]  Peter Willett,et al.  Hierarchic Document Clustering Using Ward's Method. , 1986, SIGIR 1986.

[27]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[28]  M. A. Laughton,et al.  Cluster analysis of power-system networks for array processing solutions , 1985 .

[29]  G. N. Lance,et al.  A general theory of classificatory sorting strategies: II. Clustering systems , 1967, Comput. J..

[30]  Esen A. Ozkarahan,et al.  System architecture for information processing , 1991, Inf. Process. Manag..

[31]  F. James Rohlf,et al.  12 Single-link clustering algorithms , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[32]  Peter Willett,et al.  Bibliographic pattern matching using the ICL Distributed Array Processor , 1988, Journal of the American Society for Information Science.

[33]  Harold S. Stone,et al.  Parallel Querying of Large Databases: A Case Study , 1987, Computer.

[34]  A. A. Jackson,et al.  Drooling — A non-parametric multidimensional clustering algorithm for distributed array processor , 1982 .

[35]  Kai Hwang,et al.  Computer architecture and parallel processing , 1984, McGraw-Hill Series in computer organization and architecture.

[36]  Gerard Salton,et al.  Parallel text search methods , 1988, CACM.

[37]  Dennis Parkinson The Distributed Array Processor (DAP) , 1983 .

[38]  Richard C. T. Lee Clustering Analysis and Its Applications , 1981 .

[39]  Edie M. Rasmussen,et al.  Automatic classification of chemical structure databases using a highly parallel array processor , 1988 .

[40]  D. Wishart Clustan : user manual , 1978 .

[41]  Michael J. Flynn,et al.  Very high-speed computing systems , 1966 .