A dynamic graph model for representing streaming text documents

This work presents new techniques for representing an evolving stream of text documents. Text processing is traditionally performed on a fixed corpus of documents by representing the documents as vectors in a high-dimensional space with each dimension corresponding to a different word in the lexicon. The lexicon is formed by the set of unique words in the corpus. The vector entries are equal to the counts of the word in the document and often weighted by the inverse of the probability of the corresponding word occurring in a document. The probability of word occurrence, also called the document frequency, is needed in order to create document vectors which emphasize the informative words in each document. In order to apply statistical text processing techniques to a changing corpus of documents, a generalization of the vector space model is introduced. The generalization relies on managing a changing lexicon of words and approximating the probability of word occurrence over documents in the document stream. The methods presented here can be used to represent any new document as a vector, including documents that contain words that have not been seen previously in the document stream. Additionally, this work presents a graph model for representing a dynamic corpus of text documents. The graph model differs from other methods for text clustering which act on a fixed corpus of documents. The vertices in the graph represent topics and evolve as the document stream changes. The vertices contain statistics on documents of a similar topic. Each vertex has an associated lexicon and document frequency which can be used to provide information about the document stream. The graph model is demonstrated on a dataset of news articles collected over several years.