Formal concept analysis for topic detection: A clustering quality experimental analysis

We propose a novel application of FCA-based methods for Topic Detection, overcoming traditional problems of the clustering and classification techniques.We achieve state-of-the-art results for the topic detection task at Replab 2013.We propose an evaluation framework to measure the quality of the topic detection algorithms, including an external and an internal (quality based) evaluation methodology.We conduct an extensive analysis of the performance for the topic detection task of Hierarchical Agglomerative Clustering and Latent Dirichlet Allocation in comparison to FCA.We prove that the proposed FCA-based approach is better, in terms of clustering quality, than the two others. The Topic Detection task is focused on discovering the main topics addressed by a series of documents (e.g., news reports, e-mails, tweets). Topics, defined in this way, are expected to be thematically similar, cohesive and self-contained. This task has been broadly studied from the point of view of clustering and probabilistic techniques. In this work, we propose for this task the application of Formal Concept Analysis (FCA), an exploratory technique for data analysis and organization. In particular, we propose an extension of FCA-based methods for topic detection applied in the literature by applying the stability concept for the topic selection. The hypothesis is that FCA will enable the better organization of the data and stability the better selection of topics based on this data organization, thus better fulfilling the task requirements by improving the quality and accuracy of the topic detection process. In addition, the proposed FCA-based methodology is able to cope with some well-known drawbacks that clustering and probabilistic methodologies present, such as: the need to set a predefined number of clusters or the difficulty in dealing with topics with complex generalization-specialization relationships. In order to prove this hypothesis, the FCA operation is compared to other established techniques Hierarchical Agglomerative Clustering (HAC) and Latent Dirichlet Allocation (LDA). To allow this comparison, these approaches have been implemented by the authors in a novel experimental framework. The quality of the topics detected by the different approaches in terms of their suitability for the topic detection task is evaluated by means of internal clustering validity metrics. This evaluation demonstrates that FCA generates cohesive clusters, which are less subject to changes in cluster granularity. Driven by the quality of the detected topics, FCA achieves the best general outcome, improving the experimental results for Topic Detection Task at the 2013 Replab Campaign.

[1]  Steffen Staab,et al.  Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis , 2005, J. Artif. Intell. Res..

[2]  Hanan Samet,et al.  TwitterStand: news in tweets , 2009, GIS.

[3]  Julio Gonzalo,et al.  Automatic Selection of Noun Phrases as Document Descriptors in an FCA-Based Information Retrieval System , 2005, ICFCA.

[4]  Sergei O. Kuznetsov,et al.  On stability of a formal concept , 2007, Annals of Mathematics and Artificial Intelligence.

[5]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[6]  Hila Becker,et al.  Beyond Trending Topics: Real-World Event Identification on Twitter , 2011, ICWSM.

[7]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[8]  Xin Wang,et al.  Finding Topics in Email Using Formal Concept Analysis and Fuzzy Membership Functions , 2008, Canadian Conference on AI.

[9]  Yang Xiang,et al.  LDA-based online topic detection using tensor factorization , 2013, J. Inf. Sci..

[10]  Mimmo Parente,et al.  Time Aware Knowledge Extraction for microblog summarization on Twitter , 2015, Inf. Fusion.

[11]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[12]  Jean-François Boulicaut New applications of Formal Concept Analysis: a Need for New Pattern Domains , 2013 .

[13]  T. Murata,et al.  Breaking News Detection and Tracking in Twitter , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[14]  Daniele Quercia,et al.  TweetLDA: supervised topic classification and link prediction in Twitter , 2012, WebSci '12.

[15]  R. Wille Concept lattices and conceptual knowledge systems , 1992 .

[16]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[17]  Yajun Du,et al.  Topic Detection of News Stories with Formal Concept Analysis ⋆ , 2011 .

[18]  Julio Gonzalo,et al.  Overview of RepLab 2013: Evaluating Online Reputation Monitoring Systems , 2013, CLEF.

[19]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[20]  Luis Alfonso Ureña López,et al.  SINAI: Machine Learning and Emotion of the Crowd for Sentiment Analysis in Microblogs , 2013, *SEMEVAL.

[21]  Athena Vakali,et al.  Social networking trends and dynamics detection via a cloud-based framework design , 2012, WWW.

[22]  Rudolf Wille,et al.  Restructuring Lattice Theory: An Approach Based on Hierarchies of Concepts , 2009, ICFCA.

[23]  Vincenzo Loia,et al.  Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis , 2012, Inf. Process. Manag..

[24]  Camille Roth,et al.  Towards Concise Representation for Taxonomies of Epistemic Communities , 2006, CLA.

[25]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[26]  Julio Gonzalo,et al.  Browsing Search Results via Formal Concept Analysis: Automatic Selection of Attributes , 2004, ICFCA.

[27]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[28]  Ángel F. Zazo Rodríguez,et al.  REINA at RepLab2013 Topic Detection Task: Community Detection , 2013, CLEF.

[29]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[30]  Los Angeles,et al.  Probabilistic Topic Models for Graph Mining , 2014 .

[31]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[32]  Hendrik Blockeel,et al.  Using internal validity measures to compare clustering algorithms , 2015, ICML 2015.

[33]  Matthew Hurst,et al.  Event Detection and Tracking in Social Streams , 2009, ICWSM.

[34]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[35]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[36]  Yan Liu,et al.  Model Selection for Topic Models via Spectral Decomposition , 2015, AISTATS.

[37]  García-SerranoAna,et al.  A step forward for Topic Detection in Twitter , 2016 .

[38]  Gary Anthes,et al.  Topic models vs. unstructured data , 2010, Commun. ACM.

[39]  J. Caers,et al.  Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling , 2010 .

[40]  Ana M. García-Serrano,et al.  Modelling Techniques for Twitter Contents: A Step beyond Classification based Approaches , 2013, CLEF.

[41]  Ana M. García-Serrano,et al.  A step forward for Topic Detection in Twitter: An FCA-based approach , 2016, Expert Syst. Appl..

[42]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[43]  Claudio Carpineto,et al.  Concept data analysis - theory and applications , 2004 .

[44]  Charles L. Wayne Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation , 2000, LREC.

[45]  Jonas Poelmans,et al.  FCA-Based Models and a Prototype Data Analysis System for Crowdsourcing Platforms , 2013, ICCS.

[46]  Helena Ahonen-Myka,et al.  Simple Semantics in Topic Detection and Tracking , 2004, Information Retrieval.

[47]  Eréndira Rendón,et al.  A comparison of internal and external cluster validation indexes , 2011 .

[48]  Julio Gonzalo,et al.  Learning similarity functions for topic detection in online reputation monitoring , 2014, SIGIR.

[49]  Suk-Hyung Hwang,et al.  FCA-based Conceptual Knowledge Discovery in Folksonomy , 2009 .

[50]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[51]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[52]  Yiannis Kompatsiaris,et al.  Sensing Trending Topics in Twitter , 2013, IEEE Transactions on Multimedia.

[53]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[54]  Mario Cataldi,et al.  Emerging topic detection on Twitter based on temporal and social terms evaluation , 2010, MDMKDD '10.

[55]  Gary Anthes,et al.  Security in the cloud , 2010, Commun. ACM.

[56]  Tomohiro Yoshikawa,et al.  Online topic model for Twitter considering dynamics of user interests and topic trends , 2014, EMNLP.

[57]  Ana M. García-Serrano,et al.  Linked Data-based Conceptual Modelling for Recommendation: A FCA-Based Approach , 2014, EC-Web.

[58]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Susan Gauch,et al.  ChatTrack: Chat Room Topic Detection Using Classification , 2004, ISI.

[60]  Yiannis Kompatsiaris,et al.  A soft frequent pattern mining approach for textual topic detection , 2014, WIMS '14.

[61]  Dan A. Simovici,et al.  Polarities, Axiallities and Marketability of Items , 2012, DaWaK.

[62]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[63]  Bo Huang,et al.  Microblog Topic Detection Based on LDA Model and Single-Pass Clustering , 2012, RSCTC.

[64]  Gabriella Pasi,et al.  Lattice navigation for collaborative filtering by means of (fuzzy) formal concept analysis , 2013, SAC '13.