A web content mining approach for tag cloud generation

Tag cloud, also known as word cloud, are very useful for quickly perceiving the most prominent terms embedded within a text collection to determine their relative prominence. The effectiveness of tag clouds to conceptualize a text corpus is directly proportional to the quality of the keyphrases extracted from the corpus. Although, authors provide a list of about five to ten keywords in scientific publications that are used to map them into their respective domain, due to exponential growth in non-scientific documents on the World Wide Web, an automatic mechanism is sought to identify keyphrases embedded within them for tag cloud generation. In this paper, we propose a web content mining technique to extract keyphrases from web documents for tag cloud generation. Instead of using partial or full parsing, the proposed method applies n-gram technique followed by various heuristics-based refinements to identify a set of lexical and semantic features from text documents. We propose a rich set of domain-independent features to model candidate keyphrases very effectively for establishing their keyphraseness using classification models. We also propose a font-determination function to determine the relative font-size of keyphrases for tag cloud generation. The efficacy of the proposed method is established through experimentation. The proposed method outperforms the popular keyphrase extraction system KEA.

[1]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[2]  Michael Cardew-Hall,et al.  The folksonomy tag cloud: when is it useful? , 2008, J. Inf. Sci..

[3]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[4]  Ian Horrocks,et al.  Patel-Schneider: OIL: Ontology Infrastructure to Enable the Semantic Web , 2001 .

[5]  Malika Mahoui,et al.  Hierarchical document clustering using automatically extracted keyphrases , 2000 .

[6]  Mark S. Staveley,et al.  Phrasier: a system for interactive document retrieval using keyphrases , 1999, SIGIR '99.

[7]  Mika Käki,et al.  Information search and re-access strategies of experienced web users , 2005, WWW '05.

[8]  Dana J. Vanier,et al.  Use of Keyphrase Extraction Software for Creation of an AEC/FM Thesaurus , 2000, J. Inf. Technol. Constr..

[9]  Yi-fang Brook Wu,et al.  Domain-specific keyphrase extraction , 2005, CIKM '05.

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  Georgia Koutrika,et al.  Data clouds: summarizing keyword search results over structured data , 2009, EDBT '09.

[12]  Benjamin M. Good,et al.  Tag clouds for summarizing web search results , 2007, WWW '07.

[13]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[14]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[15]  Yang Song,et al.  Real-time automatic tag recommendation , 2008, SIGIR '08.

[16]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[17]  Joongmin Choi,et al.  Web Document Clustering by Using Automatic Keyphrase Extraction , 2007, 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops.

[18]  Ian Horrocks,et al.  OIL: An Ontology Infrastructure for the Semantic Web , 2001, IEEE Intell. Syst..

[19]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[20]  Carl Gutwin,et al.  Improving browsing in digital libraries with keyphrase indexes , 1999, Decis. Support Syst..

[21]  Ian H. Witten,et al.  Topic indexing with Wikipedia , 2008 .

[22]  Yi-fang Brook Wu,et al.  Incorporating Document Keyphrases in Search Results , 2004, AMCIS.

[23]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[24]  Manfred Tscheligi,et al.  Semantically structured tag clouds: an empirical evaluation of clustered presentation approaches , 2009, CHI.

[25]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.

[26]  Owen Kaser,et al.  Tag-Cloud Drawing: Algorithms for Cloud Visualization , 2007, ArXiv.