Feature expansion for Microblogging text based on Latent Dirichlet Allocation with User Feature

Traditional TDT (Topic Detection and Tracking, TDT) is based on large scale of news stream. However, with the development of new technology, Microblogging platform has become a new generation of platform for information distribution and communication. As many features which are totally different from the common news report exist in Microblogging text, old methods for TDT become ineffective. We present a new framework based on U-LDA (Latent Dirichlet Allocation with User Feature, U-LDA) which considers the user features on the Microblogging platform. We expand the feature of short text on the Microblogging platform by using U-LDA Model, which improves the precision of TDT tasks. In this paper, we discuss and summarize the particular features of Microblogging text, and present a method which considers user features in LDA model, thus we propose a general TDT framework based on U-LDA model. By applying the new model on a Microblogging corpus, we conclude that U-LDA is more effective than LDA.

[1]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[3]  Wei Zhang,et al.  Opinion retrieval from blogs , 2007, CIKM '07.

[4]  Zhang Xiaoyan Research of Technologies on Topic Detection and Tracking , 2009 .

[5]  Shui-Lung Chuang,et al.  A practical web-based approach to generating topic hierarchy for text segments , 2004, CIKM '04.

[6]  Vijay V. Raghavan,et al.  On the reuse of past optimal queries , 1995, SIGIR '95.

[7]  Hsin-Hsi Chen,et al.  Opinion Extraction, Summarization and Tracking in News and Blog Corpora , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[8]  Larry Fitzpatrick,et al.  Automatic feedback using past queries: social searching? , 1997, SIGIR '97.

[9]  Kai Zhang,et al.  Mining common topics from multiple asynchronous text streams , 2009, WSDM '09.

[10]  Mark Liberman,et al.  THE TDT-2 TEXT AND SPEECH CORPUS , 1999 .

[11]  Dongwoo Kim,et al.  Topic Chains for Understanding a News Corpus , 2011, CICLing.

[12]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[13]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[14]  Jugal K. Kalita,et al.  Experiments in Microblog Summarization , 2010, 2010 IEEE Second International Conference on Social Computing.