An unsupervised multilingual approach for online social media topic identification

An unsupervised multilingual approach to identify topics on Twitter is proposed.Localised language can be leveraged for identifying relevant and important topics.Joint term ranking coupled with DPMM clustering consistently performed well.Multilingual sentiment analysis is essential to understand sentiment on the ground.Topics coverage of social media and main stream media does not always stay the same. Social media data can be valuable in many ways. However, the vast amount of content shared and the linguistic variants of languages used on social media are making it very challenging for high-value topics to be identified. In this paper, we present an unsupervised multilingual approach for identifying highly relevant terms and topics from the mass of social media data. This approach combines term ranking, localised language analysis, unsupervised topic clustering and multilingual sentiment analysis to extract prominent topics through analysis of Twitter's tweets from a period of time. It is observed that each of the ranking methods tested has their strengths and weaknesses, and that our proposed Joint ranking method is able to take advantage of the strengths of the ranking methods. This Joint ranking method coupled with an unsupervised topic clustering model is shown to have the potential to discover topics of interest or concern to a local community. Practically, being able to do so may help decision makers to gauge the true opinions or concerns on the ground. Theoretically, the research is significant as it shows how an unsupervised online topic identification approach can be designed without much manual annotation effort, which may have great implications for future development of expert and intelligent systems.

[1]  Hae-Chang Rim,et al.  Identifying interesting Twitter contents using topical analysis , 2014, Expert Syst. Appl..

[2]  Raymond Chiong,et al.  A multilingual semi-supervised approach in deriving Singlish sentic patterns for polarity detection , 2016, Knowl. Based Syst..

[3]  Alexandra Balahur,et al.  Improving Sentiment Analysis in Twitter Using Multilingual Machine Translated Data , 2013, RANLP.

[4]  T. Mitamura Controlled language for multilingual machine translation , 1999, MTSUMMIT.

[5]  Antonio Moreno,et al.  Unsupervised topic discovery in micro-blogging networks , 2015, Expert Syst. Appl..

[6]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[7]  Matthew A. Russell,et al.  Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More , 2018 .

[8]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[9]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[10]  Jugal K. Kalita,et al.  Streaming trend detection in Twitter , 2013, Int. J. Web Based Communities.

[11]  Raymond Chiong,et al.  Multilingual sentiment analysis: from formal to informal and scarce resource languages , 2016, Artificial Intelligence Review.

[12]  David A. Shamma,et al.  Peaks and persistence: modeling the shape of microblog conversations , 2011, CSCW '11.

[13]  Andrea Zielinski,et al.  Multilingual analysis of twitter news in support of mass emergency events , 2012, ISCRAM.

[14]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[15]  Fotis Psallidas,et al.  Effective Event Identification in Social Media , 2013, IEEE Data Eng. Bull..

[16]  David Cornforth,et al.  Identifying the High-Value Social Audience from Twitter through Text-Mining Methods , 2015 .

[17]  Yiqun Liu,et al.  Emotion Tokens: Bridging the Gap among Multilingual Twitter Sentiment Analysis , 2011, AIRS.

[18]  David Cornforth,et al.  Ranking of high-value social audiences on Twitter , 2016, Decis. Support Syst..

[19]  Jakob R. E. Leimgruber Singapore English , 2011, Lang. Linguistics Compass.

[20]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[21]  Michael Picheny,et al.  MARS: A Statistical Semantic Parsing and Generation-Based Multilingual Automatic tRanslation System , 2002, Machine Translation.

[22]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[23]  Kalina Bontcheva,et al.  Making sense of social media streams through semantics: A survey , 2014, Semantic Web.

[24]  Teruko Mitamura Controlled Language for Multilingual Machine Translation 1 , 1999 .

[25]  Xiaojie Wang,et al.  Dirichlet Process Mixture Models based topic identification for short text streams , 2011, 2011 7th International Conference on Natural Language Processing and Knowledge Engineering.

[26]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Iñaki San Vicente,et al.  Multilingual sentiment analysis in social media , 2019 .

[28]  Pericles A. Mitkas,et al.  Event identification in web social media through named entity recognition and topic modeling , 2013, Data Knowl. Eng..

[29]  Alice H. Oh,et al.  Sociolinguistic analysis of Twitter in multilingual societies , 2014, HT.

[30]  Wenhan Luo,et al.  Automatic Topic Discovery for Multi-Object Tracking , 2015, AAAI.

[31]  Juan-Zi Li,et al.  TSDPMM: Incorporating Prior Topic Knowledge into Dirichlet Process Mixture Models for Text Clustering , 2015, EMNLP.

[32]  Walid Magdy,et al.  Unsupervised adaptive microblog filtering for broad dynamic topics , 2016, Inf. Process. Manag..

[33]  Hila Becker,et al.  Beyond Trending Topics: Real-World Event Identification on Twitter , 2011, ICWSM.

[34]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[35]  David Yarowsky,et al.  Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams , 2013, ACL.

[36]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[37]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[38]  Yiannis Kompatsiaris,et al.  Sensing Trending Topics in Twitter , 2013, IEEE Transactions on Multimedia.

[39]  José Manuel Perea Ortega,et al.  Sentiment analysis system adaptation for multilingual processing: The case of tweets , 2015, Inf. Process. Manag..

[40]  Fakhri Karray,et al.  Exemplar-Based Topic Detection in Twitter Streams , 2015, ICWSM.

[41]  Hsin-Min Lu,et al.  Detecting short-term cyclical topic dynamics in the user-generated content and news , 2015, Decis. Support Syst..