How Different Are Language Models andWord Clouds?

Word clouds are a summarised representation of a document’s text, similar to tag clouds which summarise the tags assigned to documents. Word clouds are similar to language models in the sense that they represent a document by its word distribution. In this paper we investigate the differences between word cloud and language modelling approaches, and specifically whether effective language modelling techniques also improve word clouds. We evaluate the quality of the language model using a system evaluation test bed, and evaluate the quality of the resulting word cloud with a user study. Our experiments show that different language modelling techniques can be applied to improve a standard word cloud that uses a TF weighting scheme in combination with stopword removal. Including bigrams in the word clouds and a parsimonious term weighting scheme are the most effective in both the system evaluation and the user study.

[1]  Chris Buckley,et al.  Relevance Feedback Track Overview: TREC 2008 , 2008, TREC.

[2]  Michael J. Muller,et al.  Getting our head in the clouds: toward evaluation studies of tagclouds , 2007, CHI.

[3]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[4]  Ian Ruthven,et al.  Re-examining the potential effectiveness of interactive query expansion , 2003, SIGIR.

[5]  Christopher H. Brooks,et al.  Improved annotation of the blogosphere via autotagging and hierarchical clustering , 2006, WWW '06.

[6]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[7]  Benjamin M. Good,et al.  Tag clouds for summarizing web search results , 2007, WWW '07.

[8]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[9]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[10]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[11]  Djoerd Hiemstra,et al.  Parsimonious language models for information retrieval , 2004, SIGIR '04.

[12]  Martin Halvey,et al.  An assessment of tag presentation techniques , 2007, WWW '07.

[13]  Jonathan Feinberg,et al.  Wordle , 2010, Beautiful Visualization.

[14]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[15]  Fernando Pereira,et al.  Generating summary keywords for emails using topics , 2008, IUI '08.

[16]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[17]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[18]  Marcel Ausloos,et al.  Contextualising tags in collaborative tagging systems , 2009, HT '09.

[19]  Carl Gutwin,et al.  Seeing things in the clouds: the effect of visual features on tag cloud selections , 2008, Hypertext.