Utilizing Wikipedia Topic Hierarchy in Estimating Topics of Blog Sites and Topic Distribution of Blogsphere

This paper studies how to estimate distribution of topics in Japanese Blogosphere, where about 300,000 Wikipedia entries are used for representing a hierarchy of topics. First, in order to estimate whether there exists at least one blog feed closely related to a given topic, we use the number of hits of the topic keyword in the blogosphere. We empirically examine the range of the number of hits and conclude that the range should be 10,000 ∼ 500,000. According to our manual evaluation of this range, about 70% of Wikipedia entries can be linked to at least one blog feed, which partially justifies our claim. Then, we apply SVMs to the task of judging whether, given a topic, each of blog feeds is closely related to the given topic. Based on the learned SVMs model, we further automatically judge whether there exists at least one blog feed closely related to a given topic.

[1]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[2]  Diego Sona,et al.  Clustering documents in a web directory , 2003, WIDM '03.

[3]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[4]  Craig MacDonald,et al.  Overview of the TREC 2007 Blog Track , 2007, TREC.

[5]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.