Latent Friend Mining from Blog Data

The rapid growth of blog (also known as "weblog") data provides a rich resource for social community mining. In this paper, we put forward a novel research problem of mining the latent friends of bloggers based on the contents of their blog entries. Latent friends are defined in this paper as people who share the similar topic distribution in their blogs. These people may not actually know each other, but they have the interest and potential to find each other out. Three approaches are designed for latent friend detection. The first one, called cosine similarity-based method, determines the similarity between bloggers by calculating the cosine similarity between the contents of the blogs. The second approach, known as topic-based method, is based on the discovery of latent topics using a latent topic model and then calculating the similarity at the topic level. The third one is two-level similarity-based, which is conducted in two stages. In the first stage, an existing topic hierarchy is exploited to build a topic distribution for a blogger. Then, in the second stage, a detailed similarity comparison is conducted for bloggers that are close in interest to each other which are discovered in the first stage. Our experimental results show that both the topic-based and two-level similarity-based methods work well, and the last approach performs much better than the first two. In this paper, we give a detailed analysis of the advantages and disadvantages of different approaches.

[1]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[2]  Chao Liu,et al.  A probabilistic approach to spatiotemporal theme pattern mining on weblogs , 2006, WWW '06.

[3]  Toyoaki Nishida,et al.  Analyzing concerns of people using Weblog articles and real world temporal data , 2005 .

[4]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[5]  Kazunari Ishida Extracting Latent Weblog Communities-A Partitioning Algorithm for Bipartite Graphs - , 2005 .

[6]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[7]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[8]  Sharon L. Milgram,et al.  The Small World Problem , 1967 .

[9]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[10]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[11]  Bart Selman,et al.  Referral Web: combining social networks and collaborative filtering , 1997, CACM.

[12]  Ravi Kumar,et al.  On the Bursty Evolution of Blogspace , 2003, WWW '03.

[13]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.

[14]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[15]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[16]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[17]  Ko Fujimura,et al.  The EigenRumor Algorithm for Ranking Blogs , 2005 .

[18]  Jun'ichi Tatemura,et al.  Discovering Important Bloggers based on Analyzing Blog Threads , 2005 .

[19]  H. White,et al.  STRUCTURAL EQUIVALENCE OF INDIVIDUALS IN SOCIAL NETWORKS , 1977 .

[20]  Michael F. Schwartz,et al.  Discovering shared interests using graph analysis , 1993, CACM.

[21]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[22]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[23]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[24]  Mark Craven,et al.  Combining Statistical and Relational Methods for Learning in Hypertext Domains , 1998, ILP.

[25]  Matthew Richardson,et al.  Mining the network value of customers , 2001, KDD '01.

[26]  Qiang Yang,et al.  Building bridges for web query classification , 2006, SIGIR.

[27]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.

[28]  Paolo Avesani,et al.  Learning Contextualised Weblog Topics , 2005 .

[29]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[30]  Lois Ann Scheidt,et al.  Bridging the gap: a genre analysis of Weblogs , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[31]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[32]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[33]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994 .

[34]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[35]  David A. Huffaker,et al.  Gender, Identity, and Language Use in Teenage Blogs , 2006, J. Comput. Mediat. Commun..