An Analysis of Some Graph Theoretical Cluster Techniques

Several graph theoretic cluster techniques aimed at the automatic generation of thesauri for information retrieval systems are explored. Experimental cluster analysis is performed on a sample corpus of 2267 documents. A term-term similarity matrix is constructed for the 3950 unique terms used to index the documents. Various threshold values, T, are applied to the similarity matrix to provide a series of binary threshold matrices. The corresponding graph of each binary threshold matrix is used to obtain the term clusters. Three definitions of a cluster are analyzed: (1) the connected components of the threshold matrix; (2) the maximal complete subgraphs of the connected components of the threshold matrix; (3) clusters of the maximal complete subgraphs of the threshold matrix, as described by Gotlieb and Kumar. Algorithms are described and analyzed for obtaining each cluster type. The algorithms are designed to be useful for large document and index collections. Two algorithms have been tested that find maximal complete subgraphs. An algorithm developed by Bierstone offers a significant time improvement over one suggested by Bonner. For threshold levels T ≥ 0.6, basically the same clusters are developed regardless of the cluster definition used. In such situations one need only find the connected components of the graph to develop the clusters.

[1]  D J Rogers,et al.  A Computer Program for Classifying Plants. , 1960, Science.

[2]  H. Edmund Stiles,et al.  The Association Factor in Information Retrieval , 1961, JACM.

[3]  Journal of the Association for Computing Machinery , 1961, Nature.

[4]  M. Kochen,et al.  Concerning the possibility of a cooperative information exchange , 1962 .

[5]  Harold Borko,et al.  The construction of an empirically based mathematically derived classification system , 1899, AIEE-IRE '62 (Spring).

[6]  P. E. Jones,et al.  LINEAR ASSOCIATIVE INFORMATION RETRIEVAL , 1962 .

[7]  Frank B. Baker,et al.  Information Retrieval Based upon Latent Class Analysis , 1962, JACM.

[8]  Roger M. Needham,et al.  A Method for Using Computers in Information Classification , 1962, IFIP Congress.

[9]  P. M. Marcus Fundamental research in superconductivity , 1962 .

[10]  O. Ore,et al.  Graphs and Their Uses , 1964 .

[11]  Harold Borko RESEARCH IN DOCUMENT CLASSIFICATION AND FILE ORGANIZATION , 1963 .

[12]  Raymond E. Bonner,et al.  On Some Clustering Techniques , 1964, IBM J. Res. Dev..

[13]  Harold Borko,et al.  Automatic Document Classification Part II . Additional Experiments , 1964, JACM.

[14]  Geoffrey H. Ball,et al.  Data analysis in the social sciences: what about the details? , 1965, AFIPS '65 (Fall, part I).

[15]  Mary Elizabeth Stevens,et al.  Automatic indexing : a state-of-the art report , 1965 .

[16]  Graph separability and word grouping , 1966, CACM.

[17]  Evan Leon Ivie Search procedures based on measures of relatedness between documents. , 1966 .

[18]  L. B. Doyle BREAKING THE COST BARRIER IN AUTOMATIC CLASSIFICATION , 1966 .

[19]  A. R. Meetham Graph separability and word grouping , 1966, ACM '66.

[20]  Karen Spärck Jones,et al.  Current approaches to classification and clump-finding at the Cambridge Language Research Unit , 1967, Comput. J..

[21]  Maurice V. Wilkes,et al.  The design of multiple-access computer systems , 1967, Computer/law journal.

[22]  C. H. Hunt,et al.  Computers and the small firm: 1 , 1968, Comput. J..

[23]  Calvin C. Gotlieb,et al.  Semantic Clustering of Index Terms , 1968, J. ACM.

[24]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[25]  Karen Spärck Jones Automatic term classification and information retrieval , 1968, IFIP Congress.

[26]  A. J. Willmott,et al.  Cluster analysis on the Atlas computer , 1968, Comput. J..

[27]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[28]  Samuel Schiminovich,et al.  A clustering experiment: First step towards a computer-generated classification scheme , 1968, Inf. Storage Retr..

[29]  John C. Ogilvie The distribution of number and size of connected components in random graphs of medium size , 1968, IFIP Congress.

[30]  Michael Lesk,et al.  Word-word associations in document retrieval systems , 1969 .

[31]  R. T. Dattola,et al.  A Fast Algorithm for Automatic Classification , 1969 .