COMPUTER CLASSIFICATION OF DOCUMENTS

Abstract : A word selection measure is employed to delete those terms that rarely occur and those that have a low conditional probability of occurring in a category. A set of sample documents known to belong to each category is used to estimate the mean frequency, the within category variance and the between category variance of the remaining terms. These statistics are then employed to compute discriminant functions which provide weighting coefficients for each term. A new document is classified by counting the frequencies of the selected terms occurring in it, and weighting the difference between this vector of observed frequencies and the mean vector of every category. The probability of membership in each category is computed and the document is assigned to the category having the highest probability. For applications in which assignment to one category is not desirable, the probabilities can be used to indicate multi-category assignment. A thesaurus capability allows the following types of words to be considered equivalent: inflected words, compound words, and semantically similar words with different orthographic spellings. Since the technique is based on statistical measures, it can classify documents written in any language provided a sample set of documents in that language is available.