CDIP: Collection-Driven, yet Individuality-Preserving Automated Blog Tagging

With the success of blogs as popular information sharing media, searches on blogs have become popular. In the blogosphere, tagging is used as a means of annotating blog entries with contextually meaningful keywords, which enable users more easily locate blog content. Yet, although tags provided by bloggers are effective for organizing blog entries, in many cases, they are not always sufficient in properly capturing the semantics of the blog content. In our previous work, we observed that there exists large degree of content overlap (not only in the form of quotation/commentary pairs, but also as content borrowing across media outlets) among blog entries, which makes it hard for effective, discriminating keyword searches. In this paper, we further note that these implicit or explicit quotations could be leveraged to identify the contexts in which entries occur; thus, resulting in more effective tagging. Thus, we propose CDIP (a collection-driven, yet individuality- preserving tagging system) which relies on relationships provided by quotation/reuse detection and semantic-focus analysis to automatically tag the blogs in such a way that, not-only the related blogs share tags, but also individuality of the entries is preserved for discriminating tag-based accesses.

[1]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[2]  Lada A. Adamic,et al.  Tracking information epidemics in blogspace , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[3]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[4]  Jong Wook Kim,et al.  Topic segmentation of message hierarchies for indexing and navigation support , 2005, WWW '05.

[5]  K. Selçuk Candan,et al.  CUTS: CUrvature-based development pattern analysis and segmentation for blogs and other Text Streams , 2006, HYPERTEXT '06.

[6]  B. Tseng,et al.  Tomographic Clustering To Visualize Blog Communities as Mountain Views , 2005 .

[7]  Jong Wook Kim,et al.  CP/CV: concept similarity mining without frequency information from domain describing taxonomies , 2006, CIKM '06.

[8]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[9]  Edward A. Fox,et al.  Research Contributions , 2014 .

[10]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[11]  Chao Liu,et al.  A probabilistic approach to spatiotemporal theme pattern mining on weblogs , 2006, WWW '06.

[12]  Tao Qin,et al.  Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004 , 2004, TREC.

[13]  Azadeh Shakery,et al.  Relevance Propagation for Topic Distillation UIUC TREC 2003 Web Track Experiments , 2003, TREC.

[14]  Tao Qin,et al.  A study of relevance propagation for web search , 2005, SIGIR '05.

[15]  Gilad Mishne,et al.  AutoTag: a collaborative approach to automated tag assignment for weblog posts , 2006, WWW '06.

[16]  Christopher H. Brooks,et al.  Improved annotation of the blogosphere via autotagging and hierarchical clustering , 2006, WWW '06.