Generating Cohesive Semantic Topics from Latent Factors

Extracting topics from posts in social networks is a challenging and relevant computational task. Traditionally, topics are extracted by analyzing syntactic properties in the messages, assuming a high correlation between syntax and semantics. This work proposes SToC, a new method for generating more cohesive and meaningful semantic topics within a context. SToC post-processes the output of a Non-Negative Matrix Factorization (NMF) method in order to determine which latent factors should be further merged to improve cohesion. Based on NMF's output, SToC defines a topics transition graph and uses Markovian theory to merge pairs of topics mutually reachable in this graph. Experiments on two real data sample from Twitter demonstrate that is statistically better than fair baselines in supervised scenarios and able to determine cohesive and semantically valid topics in unsupervised scenarios.

[1]  Ming-Syan Chen,et al.  Mining Web informative structures and contents based on entropy analysis , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  Qiaozhu Mei,et al.  One theme in all views: modeling consensus topics in multiple contexts , 2013, KDD.

[3]  Jaegul Choo,et al.  UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization , 2013, IEEE Transactions on Visualization and Computer Graphics.

[4]  ChenMing-Syan,et al.  Mining Web Informative Structures and Contents Based on Entropy Analysis , 2004 .

[5]  Xueqi Cheng,et al.  Group sparse topical coding: from code to topic , 2013, WSDM.

[6]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[7]  Rafael Berlanga Llavori,et al.  Topic discovery based on text mining techniques , 2007, Inf. Process. Manag..

[8]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Quan Wang,et al.  Group matrix factorization for scalable topic modeling , 2012, SIGIR '12.

[11]  Chih-Jen Lin,et al.  On the Convergence of Multiplicative Update Algorithms for Nonnegative Matrix Factorization , 2007, IEEE Transactions on Neural Networks.

[12]  Xiaohui Yan,et al.  Learning Topics in Short Texts by Non-negative Matrix Factorization on Term Correlation Matrix , 2013, SDM.

[13]  Michael W. Berry,et al.  Algorithms and applications for approximate nonnegative matrix factorization , 2007, Comput. Stat. Data Anal..

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[15]  Marcel Abendroth,et al.  Data Mining Practical Machine Learning Tools And Techniques With Java Implementations , 2016 .