Ontology-Assisted Discovery of Hierarchical Topic Clusters on the Social Web

Discovery and clustering of users by their topic of interest on the Social Web can help enhance various applications, such as user recommendation and expert finding. Traditional approaches, such as latent semantic analysis-based topic modeling or k-means document clustering, run into issues when content is sparse, the number of existing topics is unknown and/or we seek topics that are hierarchical in nature. In this paper, we propose a method for ontology-assisted topic clustering, in which we map Social Web user content to ontological classes to overcome sparsity. Using a novel ranking technique for calculating the topical similarity between individuals at different topic scopes, we construct graphs on which we apply a quasi-clique algorithm in order to find topic clusters at that scope, without having to pre-define a target number of topics. Our approach allows (1) the topic scope to be controlled in order to discover general or specific topics; (2) the automatic labeling of clusters with tags that are human and machine-understandable; and (3) graphs to be clustered recursively in order to generate a hierarchy of topics. The approach is evaluated against ground truths of Twitter users and the 20-newsgroups dataset, commonly used in document clustering research. We compare our approach to standard and Twitter-specific latent Dirichlet allocation (LDA), hierarchical LDA, and standard and hierarchical k-means clustering. Results show that our method outperforms regular LDA by up to 24.7%, Twitter-LDA by up to 11.9%, and k-means by up to 26.7% on Social Web content. It performs equivalently, depending on several factors, to these approaches on a dataset of traditional documents. Additionally, our method can discover the appropriate number and composition of topics at a given topic scope, whereas k-means clustering cannot account for differences in scope.

[1]  T. Vicsek,et al.  Clique percolation in random networks. , 2005, Physical review letters.

[2]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[3]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Shuang-Hong Yang,et al.  Large-scale high-precision topic modeling on twitter , 2014, KDD.

[6]  Qiang Wang,et al.  Topic oriented community detection through social objects and link analysis in social networks , 2012, Knowl. Based Syst..

[7]  Ee-Peng Lim,et al.  Of Information Systems School of Information Systems 11-2014 On Joint Modeling of Topical Communities and Personal Interest in Microblogs , 2017 .

[8]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[9]  Stan Matwin,et al.  Learning and Evaluation in the Presence of Class Hierarchies: Application to Text Categorization , 2006, Canadian AI.

[10]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[11]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[12]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[13]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[14]  Jure Leskovec,et al.  Empirical comparison of algorithms for network community detection , 2010, WWW '10.

[15]  Matthew Michelson,et al.  Tweet Disambiguate Entities Retrieve Folksonomy SubTree Step 1 : Discover Categories Generate Topic Profile from SubTrees Step 2 : Discover Profile Topic Profile : “ English Football ” “ World Cup ” , 2010 .

[16]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Takehiro Tokuda,et al.  Towards Twitter User Recommendation Based on User Relations and Taxonomical Analysis , 2013, EJC.

[18]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[19]  Yiannis Kompatsiaris,et al.  Community detection in Social Media , 2012, Data Mining and Knowledge Discovery.

[20]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[21]  Kalina Bontcheva,et al.  Microblog-genre noise and impact on semantic annotation accuracy , 2013, HT.

[22]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[23]  Ari Rappoport,et al.  Efficient Clustering of Short Messages into General Domains , 2013, ICWSM.

[24]  Qi Gao,et al.  Analyzing user modeling on twitter for personalized news recommendations , 2011, UMAP'11.

[25]  Mark S. Granovetter The Strength of Weak Ties , 1973, American Journal of Sociology.

[26]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[27]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[28]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[29]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[30]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[31]  Willem Robert van Hage,et al.  Sample Evaluation of Ontology-Matching Systems , 2007, EON.

[32]  Konrad P. Körding,et al.  A high-reproducibility and high-accuracy method for automated topic classification , 2014, ArXiv.

[33]  Pablo N. Mendes,et al.  Improving efficiency and accuracy in multilingual entity extraction , 2013, I-SEMANTICS '13.

[34]  Ron Shamir,et al.  A clustering algorithm based on graph connectivity , 2000, Inf. Process. Lett..

[35]  Feida Zhu,et al.  It Is Not Just What We Say, But How We Say Them: LDA-based Behavior-Topic Model , 2013, SDM.

[36]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[37]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[38]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[39]  Ming Zhou,et al.  Joint Inference of Named Entity Recognition and Normalization for Tweets , 2012, ACL.

[40]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.