Keyword extraction for blogs based on content richness

In this paper, a method is proposed to extract topic keywords of blogs, based on the richness of content. If a blog includes rich content related to a topic word, the word can be considered as a keyword of the blog. For this purpose, a new measure, richness, is proposed, which indicates how much a blog covers the trendy subtopics of a keyword. In order to obtain trendy subtopics of keywords, we use outside topical context data – the web. Since the web includes various and trendy information, we can find popular and trendy content related to a topic. For each candidate keyword, a set of web documents is retrieved by Google, and the subtopics found in the web documents are modelled by a probabilistic approach. Based on the subtopic models, the proposed method evaluates the richness of blogs for candidate keywords, in terms of how much a blog covers the trendy subtopics of keywords. If a blog includes various contents on a word, the word needs to be chosen as one of the keywords of the blog. In the experiments, the proposed method is compared with various methods, and shows better results, in terms of hit count, trendiness and consistency.

[1]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[2]  Prasenjit Mitra,et al.  Predicting Blogging Behavior Using Temporal and Social Networks , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Christian Wartena,et al.  Thesaurus based term ranking for keyword extraction : IEEE Proceedings of the 7th International Workshop on Text-based Information Retrieval (TIR-10), Bilbao, Spain , 2010 .

[5]  Anette Hulth,et al.  Automatic Keyword Extraction Using Domain Knowledge , 2001, CICLing.

[6]  Noah A. Smith,et al.  Predicting Response to Political Blog Posts with Topic Models , 2009, NAACL.

[7]  Shimei Pan,et al.  TIARA: Interactive, Topic-Based Visual Text Summarization and Analysis , 2012, TIST.

[8]  Michael Cardew-Hall,et al.  The folksonomy tag cloud: when is it useful? , 2008, J. Inf. Sci..

[9]  Lance Chun Che Fung,et al.  Automatic Web Content Extraction for Generating Tag Clouds from Thai Web Sites , 2011, 2011 IEEE 8th International Conference on e-Business Engineering.

[10]  Peter D. Turney Coherent Keyphrase Extraction via Web Mining , 2003, IJCAI.

[11]  Ramesh Nallapati,et al.  Link-PLSA-LDA: A New Unsupervised Model for Topics and Influence of Blogs , 2021, ICWSM.

[12]  Hak-Joon Sim,et al.  Creating Related Tag Groups using Co-occurrence Frequency on Blogosphere , 2009 .

[13]  Yuichiro Sekiguchi,et al.  Topic Detection from Blog Documents Using Users’ Interests , 2006, 7th International Conference on Mobile Data Management (MDM'06).

[14]  Eric P. Xing,et al.  Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective , 2010, EMNLP.

[15]  Alton Yeow-Kuan Chua,et al.  Social tags for resource discovery: a comparison between machine learning and user-centric approaches , 2011, J. Inf. Sci..

[16]  Christian Wartena,et al.  Keyword Extraction Using Word Co-occurrence , 2010, 2010 Workshops on Database and Expert Systems Applications.

[17]  Christian Wartena,et al.  Thesaurus Based Term Ranking for Keyword Extraction , 2010, 2010 Workshops on Database and Expert Systems Applications.

[18]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[19]  David M. W. Powers,et al.  Applications and Explanations of Zipf’s Law , 1998, CoNLL.

[20]  Hong-Gee Kim,et al.  A social inverted index for social-tagging-based information retrieval , 2012, J. Inf. Sci..

[21]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[22]  Edward Y. Chang,et al.  Mining blog stories using community-based and temporal clustering , 2006, CIKM '06.

[23]  Dejing Dou,et al.  Using multiple ontologies in information extraction , 2009, CIKM.

[24]  Christopher H. Brooks,et al.  Improved annotation of the blogosphere via autotagging and hierarchical clustering , 2006, WWW '06.

[25]  Xiaohua Hu,et al.  User tags versus expert-assigned subject terms: A comparison of LibraryThing tags and Library of Congress Subject Headings , 2010, J. Inf. Sci..