A Generalized Topic Modeling Approach for Maven Search

This paper addresses the problem of semantics-based maven search in research community, which means identifying a person with some given expertise. Traditional approaches either ignored semantic knowledge or temporal information, resulting in some right mavens that cannot be effectively identified because of non-occurrence of keywords and un-exploitation of time effects. In this paper, we propose a novel semantics and temporal information based maven search (STMS) approach to discover latent topics (semantically related soft clusters of words) between the authors, venues (conferences or journals) and time simultaneously. In the proposed approach, each author in a venue is represented as a probability distribution over topics, and each topic is represented as a probability distribution over words and year of the venue for that topic. Through discovered latent topics we can search mavens by implicitly modeling word-author, author-author and author-venue correlations with continuous time effects. Inference making procedure for topics and authors of new venues is explained. We also show how authors' correlations can be discovered and the bad effect of topics sparseness on the retrieval performance. Experimental results on the corpus downloaded from DBLP show that proposed approach significantly outperformed the baseline approach, due to its ability to produce less sparse topics.

[1]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[2]  Juan-Zi Li,et al.  A Mixture Model for Expert Finding , 2008, PAKDD.

[3]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Wei-Ying Ma,et al.  Web object retrieval , 2007, WWW '07.

[5]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[6]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[7]  M. de Rijke,et al.  Broad expertise retrieval in sparse data environments , 2007, SIGIR.

[8]  Andrew McCallum,et al.  Expertise modeling for matching papers with reviewers , 2007, KDD '07.

[9]  Andrew Zisserman,et al.  Advances in Neural Information Processing Systems (NIPS) , 2007 .

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  C. J. van Rijsbergen,et al.  Investigating the relationship between language model perplexity and IR precision-recall measures , 2003, SIGIR.

[12]  Michael I. Jordan,et al.  Unsupervised Learning from Dyadic Data , 1998 .

[13]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[14]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[15]  Mukesh K. Mohania,et al.  Advances in Databases: Concepts, Systems and Applications , 2007 .

[16]  Thomas Hofmann,et al.  Learning from Dyadic Data , 1998, NIPS.

[17]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[18]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[19]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[20]  Juan-Zi Li,et al.  Expert Finding in a Social Network , 2007, DASFAA.

[21]  Shenghua Bao,et al.  Research on Expert Search at Enterprise Track of TREC 2006 , 2005, TREC.

[22]  David Hawking,et al.  Challenges in Enterprise Search , 2004, ADC.