Improving the Clustering of Blogosphere with a Self-term Enriching Technique

The analysis of blogs is emerging as an exciting new area in the text processing field which attempts to harness and exploit the vast quantity of information being published by individuals. However, their particular characteristics (shortness, vocabulary size and nature, etc.) make it difficult to achieve good results using automated clustering techniques. Moreover, the fact that many blogs may be considered to be narrow domain means that exploiting external linguistic resources can have limited value. In this paper, we present a methodology to improve the performance of clustering techniques on blogs, which does not rely on external resources. Our results show that this technique can produce significant improvements in the quality of clusters produced.

[1]  John G. Breslin,et al.  SIOC browser- towards a richer blog browsing experience , 2006 .

[2]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[3]  Sang-Yong Han,et al.  Fast Clustering Algorithm for Information Organization , 2003, CICLing.

[4]  David Pinto,et al.  On Clustering and Evaluation of Narrow Domain Short-Text Corpora , 2009, Proces. del Leng. Natural.

[5]  Paolo Rosso,et al.  UPV-SI: Word Sense Induction using Self Term Expansion , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[6]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[7]  Hiroshi Nakagawa,et al.  A Simple but Powerful Automatic Term Extraction Method , 2002, COLING 2002.

[8]  Béatrice Daille,et al.  Qualitative terminology extraction , 2001 .

[9]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[10]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[11]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[12]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  David Eduardo,et al.  On Clustering and Evaluation of Narrow Domain Short-Test Corpora , 2009 .

[15]  Paolo Rosso,et al.  Characterizing Weblog Corpora , 2009, NLDB.