论文信息 - Selecting the Most Highly Correlated Pairs within a Large Vocabulary

Selecting the Most Highly Correlated Pairs within a Large Vocabulary

Occurence patterns of words in documents can be expressed as binary vectors. When two vectors are similar, the two words corresponding to the vectors may have some implicit relationship with each other. We call these two words a correlated pair. This report describes a method for obtaining the most highly correlated pairs of a given size. In practice, the method requires O(N x log(N)) computation time, and O(N) memory space, where N is the number of documents or records. Since this does not depend on the size of the vocabulary under analysis, it is possible to compute correlations between all the words in a corpus.

Kyoji Umemura

[1] Tomasz Imielinski,et al. Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[2] Kenneth Ward Church,et al. Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[3] Gregory Piatetsky-Shapiro,et al. The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[4] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.