A Novel Content Enriching Model for Microblog Using News Corpus

In this paper, we propose a novel model for enriching the content of microblogs by exploiting external knowledge, thus improving the data sparseness problem in short text classification. We assume that microblogs share the same topics with external knowledge. We first build an optimization model to infer the topics of microblogs by employing the topic-word distribution of the external knowledge. Then the content of microblogs is further enriched by relevant words from external knowledge. Experiments on microblog classification show that our approach is effective and outperforms traditional text classification methods.

[1]  Weiwei Guo,et al.  Learning the Latent Semantics of a Concept from its Definition , 2012, ACL.

[2]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[3]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[4]  Heng Ji,et al.  Harnessing web page directories for large-scale classification of tweets , 2013, WWW '13 Companion.

[5]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[6]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[7]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[8]  Haiyan Chen 基于搜索引擎的词汇语义相似度计算方法 (Measuring Semantic Similarity between Words Using Web Search Engines) , 2015, 计算机科学.

[9]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Weiwei Guo,et al.  Modeling Sentences in the Latent Space , 2012, ACL.

[12]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[13]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.