Solving the Difficult Problem of Topic Extraction in Thai Tweets

We tackled in this study the difficult problem of topic extraction in Thai tweets on the country’s historic flood in 2011. After using Latent Dirichlet Allocation (LDA) to extract the topics, the first difficulty that faced us was the inaccuracy the word segmentation task that affected our interpretation of the LDA result. To solve this, we refined the stop word list from the LDA result by removing uninformative words caused by the word segmentation, which resulted to a more relevant and comprehensible outcome. With the improved results, we then constructed a rule-based categorization model and used it to categorize all the collected tweets on a per-week scale to observe changes in tweeting trend. Not only did the categories reveal the most relevant and compelling topics that people raised at that time, they also allowed us to understand how people perceived the situations as they unfold over time

[1]  Yaw-Huei Chen,et al.  Stop Word in Readability Assessment of Thai Text , 2012, 2012 IEEE 12th International Conference on Advanced Learning Technologies.

[2]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  R. Myerson Fundamentals of Social Choice Theory , 2013 .

[5]  Kenneth E. Shirley,et al.  LDAvis: A method for visualizing and interpreting topics , 2014 .

[6]  C. Haruechaiyasak,et al.  The role of Twitter during a natural disaster: Case study of 2011 Thai Flood , 2012, 2012 Proceedings of PICMET '12: Technology Management for Emerging Technologies.

[7]  Xiaotie Deng,et al.  Automatic construction of Chinese stop word list , 2006 .

[8]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[9]  Ahmad Samiei,et al.  A semi-supervised method for topic extraction from micro postings , 2015, it Inf. Technol..