An investigation of document structures

Abstract The presence of clustering structure in a document collection and the influence of the presence of clustering structure on the success of cluster-based retrieval are investigated as a function of term-weight and similarity thresholds. The term-weight threshold selects a particular level of indexing exhaustivity for the document representation, and the similarity threshold selects a specific level of the associated single-link hierarchy. Results show clear evidence for clustering structure in the most exhaustive and the least exhaustive subject representations. Results also show that observed values of cluster-based retrieval effectiveness at all exhaustivity levels can be explained by assuming that the pairwise associations responsible for the structure imposed on the document collection are generated randomly. The results suggest that the structure imposed on a small document collection by an automatically produced subject representation is unrelated to the structure imposed on the documents by relevance relationships.

[1]  Peter Willett Clustering tendency in chemical classifications , 1985, J. Chem. Inf. Comput. Sci..

[2]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[3]  L. Hubert,et al.  Data analysis and the connectivity of random graphs , 1973 .

[4]  John C. Ogilvie The distribution of number and size of connected components in random graphs of medium size , 1968, IFIP Congress.

[5]  R. F. Ling,et al.  Probability Tables for Cluster Analysis Based on a Theory of Random Graphs , 1976 .

[6]  Frank Harary,et al.  Graphical enumeration , 1973 .

[7]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[8]  Michal Karonski,et al.  A review of random graphs , 1982, J. Graph Theory.

[9]  Don R. Swanson,et al.  Historical note: Information retrieval and the future of an illusion , 1997, J. Am. Soc. Inf. Sci..

[10]  R. F. Ling The Expected Number of Components in Random Linear Graphs , 1973 .

[11]  Martin Dillon,et al.  FASIT: A fully automatic syntactically based indexing system , 1983, J. Am. Soc. Inf. Sci..

[12]  Anil K. Jain,et al.  Validity studies in clustering methodologies , 1979, Pattern Recognit..

[13]  Martin Dillon,et al.  Fully Automatic Book Indexing , 1983, J. Documentation.

[14]  Peter Willett A note on the use of nearest neighbors for implementing single linkage document classifications , 1984, J. Am. Soc. Inf. Sci..

[15]  Anil K. Jain,et al.  Clustering Methodologies in Exploratory Data Analysis , 1980, Adv. Comput..

[16]  W. Bruce Croft,et al.  Document clustering: An evaluation of some experiments with the cranfield 1400 collection , 1975, Inf. Process. Manag..

[17]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[18]  Jeffrey Katzer,et al.  A study of the overlap among document representations , 1983, SIGIR '83.

[19]  R. F. Ling An exact probability distribution on the connectivity of random graphs , 1975 .

[20]  C. J. van Rijsbergen,et al.  Further experiments with hierarchic clustering in document retrieval , 1974, Inf. Storage Retr..

[21]  William M. Shaw,et al.  An investigation of document partitions , 1986, Inf. Process. Manag..

[22]  Frank Harary,et al.  Graph Theory , 2016 .