The volume of microblogging messages is increasing exponentially with the popularity of microblogging services. With a large number of messages appearing in user interfaces, it hinders user accessibility to useful information buried in disorganized, incomplete, and unstructured text messages. In order to enhance user accessibility, we propose to aggregate related microblogging messages into clusters and automatically assign them semantically meaningful labels. However, a distinctive feature of microblogging messages is that they are much shorter than conventional text documents. These messages provide inadequate term co occurrence information for capturing semantic associations. To address this problem, we propose a novel framework for organizing unstructured microblogging messages by transforming them to a semantically structured representation. The proposed framework first captures informative tree fragments by analyzing a parse tree of the message, and then exploits external knowledge bases (Wikipedia and WordNet) to enhance their semantic information. Empirical evaluation on a Twitter dataset shows that our framework significantly outperforms existing state-of-the-art methods.
[1]
Steffen Staab,et al.
Ontologies improve text document clustering
,
2003,
Third IEEE International Conference on Data Mining.
[2]
Somnath Banerjee,et al.
Clustering short texts using wikipedia
,
2007,
SIGIR.
[3]
Steffen Staab,et al.
WordNet improves text document clustering
,
2003,
SIGIR 2003.
[4]
Dell Zhang,et al.
Question classification using support vector machines
,
2003,
SIGIR.
[5]
Kai Wang,et al.
A syntactic tree matching approach to finding similar questions in community-based qa services
,
2009,
SIGIR.
[6]
James P. Callan,et al.
Automatically labeling hierarchical clusters
,
2006,
DG.O.
[7]
Nan Sun,et al.
Exploiting internal and external semantics for the clustering of short texts using world knowledge
,
2009,
CIKM.
[8]
Evgeniy Gabrilovich,et al.
Feature Generation for Text Categorization Using World Knowledge
,
2005,
IJCAI.