Automatic classification and retrieval of documents by means of a bibliographic pattern discovery algorithm

Abstract We present completely automatic procedures for the self-generation of meaningful groups among the members of a document collection and for the classification according to these groups of subsequent documents. These procedures operate on large document collections with reasonably short computation times. Thus far, in our experiments on the physics literature, automatic classification has proven to be as good as or better than manual indexing and, in addition, potentially less expensive. Empirically derived, our method is based upon a pattern discovery algorithm which uses only the citation content of a document and which operates on the bibliographic links among papers. The self-generated groups correspond to very specific subject headings; retrospective bibliographies generated in the procedures allow one to classify the subsequent literature with remarkably high recall and relevance ratios, close to 100%.