Linked Knowledge Sources for Topic Classification of Microposts: A Semantic Graph-Based Approach

Short text messages, a.k.a microposts (e.g., tweets), have proven to be an effective channel for revealing information about trends and events, ranging from those related to disaster (e.g., Hurricane Sandy) to those related to violence (e.g., Egyptian revolution). Being informed about such events as they occur could be extremely important to authorities and emergency professionals by allowing such parties to immediately respond. In this work we study the problem of topic classification (TC) of microposts, which aims to automatically classify short messages based on the subject(s) discussed in them. The accurate TC of microposts however is a challenging task since the limited number of tokens in a post often implies a lack of sufficient contextual information. In order to provide contextual information to microposts, we present and evaluate several graph structures surrounding concepts present in linked knowledge sources (KSs). Traditional TC techniques enrich the content of microposts with features extracted only from the microposts content. In contrast our approach relies on the generation of different weighted semantic meta-graphs extracted from linked KSs. We introduce a new semantic graph, called category meta-graph. This novel meta-graph provides a more fine grained categorisation of concepts providing a set of novel semantic features. Our findings show that such category meta-graph features effectively improve the performance of a topic classifier of microposts. Furthermore our goal is also to understand which semantic feature contributes to the performance of a topic classifier. For this reason we propose an approach for automatic estimation of accuracy loss of a topic classifier on new, unseen microposts. We introduce and evaluate novel topic similarity measures, which capture the similarity between the KS documents and microposts at a conceptual level, considering the enriched representation of these documents. Extensive evaluation in the context of Emergency Response (ER) and Violence Detection (VD) revealed that our approach outperforms previous approaches using single KS without linked data and Twitter data only up to 31.4% in terms of F1 measure. Our main findings indicate that the new category graph contains useful information for TC and achieves comparable results to previously used semantic graphs. Furthermore our results also indicate that the accuracy of a topic classifier can be accurately predicted using the enhanced text representation, outperforming previous approaches considering content-based similarity measures.

[1]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[2]  Paolo Ferragina,et al.  Classification of Short Texts by Deploying Topical Annotations , 2012, ECIR.

[3]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[4]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[5]  Jeffrey V. Nickerson,et al.  Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia , 2011, HCI.

[6]  Haixun Wang,et al.  Short Text Conceptualization Using a Probabilistic Knowledgebase , 2011, IJCAI.

[7]  Efthimis N. Efthimiadis,et al.  Conversational tagging in twitter , 2010, HT '10.

[8]  Raphaël Troncy,et al.  POLITECNICO DI TORINO Repository ISTITUZIONALE NERD : A Framework for Evaluating Named Entity Recognition Tools in the Web of Data / , 2022 .

[9]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[10]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[11]  Edith Schonberg,et al.  Extracting Enterprise Vocabularies Using Linked Open Data , 2009, International Semantic Web Conference.

[12]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[13]  Amit P. Sheth,et al.  Linked Open Social Signals , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[14]  Denilson Barbosa,et al.  Topic Classification of Blog Posts Using Distant Supervision , 2012 .

[15]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[16]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[17]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[18]  Vasileios Lampos,et al.  Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical Learning Methods , 2012, ArXiv.

[19]  Óscar Corcho,et al.  Associating Semantics to Multilingual Tags in Folksonomies , 2010, EKAW.

[20]  Aba-Sah Dadzie,et al.  Making Sense of Microposts (#MSM2013) Concept Extraction Challenge , 2013, #MSM.

[21]  Qi Gao,et al.  Analyzing user modeling on twitter for personalized news recommendations , 2011, UMAP'11.

[22]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[23]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[24]  Jimmy J. Lin,et al.  Smoothing techniques for adaptive online language models: topic tracking in tweet streams , 2011, KDD.

[25]  Wouter Weerkamp,et al.  How people use Twitter in different languages , 2011 .

[26]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[27]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[28]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[29]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[30]  Vikas Sindhwani,et al.  Emerging topic detection using dictionary learning , 2011, CIKM '11.

[31]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[32]  Geert-Jan Houben,et al.  What Makes a Tweet Relevant for a Topic? , 2012, #MSM.

[33]  M. de Rijke,et al.  Adding semantics to microblog posts , 2012, WSDM '12.

[34]  Matthew Rowe,et al.  Harnessing linked knowledge sources for topic classification in social media , 2013, HT '13.

[35]  Peter Mika,et al.  Making Sense of Twitter , 2010, SEMWEB.

[36]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[37]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[38]  Fabio Ciravegna,et al.  Exploring the Similarity between Social Knowledge Sources and Twitter for Cross-domain Topic Classification , 2012, KECSM@ISWC.

[39]  Bernardo A. Huberman,et al.  Predicting the Future with Social Media , 2010, Web Intelligence.

[40]  Jonghun Park,et al.  Automatic extraction of persistent topics from social text streams , 2013, World Wide Web.

[41]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[42]  Johan Bollen,et al.  Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena , 2009, ICWSM.

[43]  P. Gloor,et al.  Predicting Stock Market Indicators Through Twitter “I hope it is not as bad as I fear” , 2011 .

[44]  S. Dumais Latent Semantic Analysis. , 2005 .

[45]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[46]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[47]  Nello Cristianini,et al.  Flu Detector - Tracking Epidemics on Twitter , 2010, ECML/PKDD.

[48]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[49]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[50]  D. Oard,et al.  Wikipedia-based topic clustering for microblogs , 2011, ASIST.

[51]  Matthew Michelson,et al.  Tweet Disambiguate Entities Retrieve Folksonomy SubTree Step 1 : Discover Categories Generate Topic Profile from SubTrees Step 2 : Discover Profile Topic Profile : “ English Football ” “ World Cup ” , 2010 .

[52]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[53]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[54]  Qiang Yang,et al.  Transferring Naive Bayes Classifiers for Text Classification , 2007, AAAI.

[55]  Oscar Corcho,et al.  Identifying Topics in Social Media Posts using DBpedia , 2011 .

[56]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[57]  Mike Thelwall,et al.  Biographies or Blenders: Which Resource Is Best for Cross-Domain Sentiment Analysis? , 2012, CICLing.