Detecting research topics via the correlation between graphs and texts

In this paper we address the problem of detecting topics in large-scale linked document collections. Recently, topic detection has become a very active area of research due to its utility for information navigation, trend analysis, and high-level description of data. We present a unique approach that uses the correlation between the distribution of a term that represents a topic and the link distribution in the citation graph where the nodes are limited to the documents containing the term. This tight coupling between term and graph analysis is distinguished from other approaches such as those that focus on language models. We develop a topic score measure for each term, using the likelihood ratio of binary hypotheses based on a probabilistic description of graph connectivity. Our approach is based on the intuition that if a term is relevant to a topic, the documents containing the term have denser connectivity than a random selection of documents. We extend our algorithm to detect a topic represented by a set of terms, using the intuition that if the co-occurrence of terms represents a new topic, the citation pattern should exhibit the synergistic effect. We test our algorithm on two electronic research literature collections,arXiv and Citeseer.Our evaluation shows that the approach is effective and reveals some novel aspects of topic detection.

[1]  M. Newman,et al.  Mixing Patterns and Community Structure in Networks , 2002, cond-mat/0210146.

[2]  Ravi Kumar,et al.  A graph-theoretic approach to extract storylines from search results , 2004, KDD.

[3]  Eduard H. Hovy,et al.  The Automated Acquisition of Topic Signatures for Text Summarization , 2000, COLING.

[4]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[5]  Andrew McCallum,et al.  The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email , 2005 .

[6]  C. Lee Giles,et al.  Clustering Scientific Literature Using Sparse Citation Graph Analysis , 2006, PKDD.

[7]  Andrew W. Moore,et al.  Detection of emerging space-time clusters , 2005, KDD '05.

[8]  Tommi S. Jaakkola,et al.  Using term informativeness for named entity detection , 2005, SIGIR '05.

[9]  Gideon S. Mann,et al.  Bibliometric impact measures leveraging topic analysis , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[10]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[11]  Hongyuan Zha,et al.  Probabilistic models for discovering e-communities , 2006, WWW '06.

[12]  Gerhard Weikum,et al.  Graph-based text classification: learn from your neighbors , 2006, SIGIR.

[13]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[14]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[16]  Atsuyoshi Nakamura,et al.  Partitioning of Web graphs by community topology , 2005, WWW '05.

[17]  M E Newman,et al.  Scientific collaboration networks. I. Network construction and fundamental results. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  Bart Selman,et al.  Natural communities in large linked networks , 2003, KDD '03.

[19]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[20]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[21]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[22]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.