Exploratory analysis of textual data streams

Abstract In this paper, we address exploratory analysis of textual data streams and we propose a bootstrapping process based on a combination of keyword similarity and clustering techniques to: (i) classify documents into fine-grained similarity clusters, based on keyword commonalities; (ii) aggregate similar clusters into larger document collections sharing a richer, more user-prominent keyword set that we call topic ; (iii) assimilate newly extracted topics of current bootstrapping cycle with existing topics resulting from previous bootstrapping cycles, by linking similar topics of different time periods, if any, to highlight topic trends and evolution. An analysis framework is also defined enabling the topic-based exploration of the underlying textual data stream according to a thematic perspective and a temporal perspective. The bootstrapping process is evaluated on a real data stream of about 330.000 newspaper articles about politics published by the New York Times from Jan 1st 1900 to Dec 31st 2015.

[1]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[2]  Jian Yin,et al.  Clustering Text Data Streams , 2008, Journal of Computer Science and Technology.

[3]  Neil Y. Yen,et al.  State transition in communication under social network: An analysis using fuzzy logic and Density Based Clustering towards big data paradigm , 2016, Future Gener. Comput. Syst..

[4]  Silvana Castano,et al.  Dimensional Clustering of Linked Data: Techniques and Applications , 2015, Trans. Large Scale Data Knowl. Centered Syst..

[5]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[6]  Shi Zhong,et al.  Efficient streaming text clustering , 2005, Neural Networks.

[7]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[8]  Seong Joon Yoo,et al.  Hot topic detection and technology trend tracking for patents utilizing term frequency and proportional document frequency and semantic information , 2016, 2016 International Conference on Big Data and Smart Computing (BigComp).

[9]  James Allan,et al.  Detection As Multi-Topic Tracking , 2002, Information Retrieval.

[10]  Philip S. Yu,et al.  Under Consideration for Publication in Knowledge and Information Systems on Clustering Massive Text and Categorical Data Streams , 2022 .

[11]  François Scharffe,et al.  Data Linking for the Semantic Web , 2011, Int. J. Semantic Web Inf. Syst..

[12]  Wolfgang Gaul,et al.  Evaluation of the evolution of relationships between topics over time , 2017, Adv. Data Anal. Classif..

[13]  Derek Greene,et al.  An analysis of the coherence of descriptors in topic modeling , 2015, Expert Syst. Appl..

[14]  李国荣,et al.  Online web video topic detection and tracking with semi-supervised learning , 2014 .

[15]  Abdolreza Abhari,et al.  Cluster-discovery of Twitter messages for event detection and trending , 2015, J. Comput. Sci..

[16]  Liangjie Hong,et al.  A time-dependent topic model for multiple text streams , 2011, KDD.

[17]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[18]  Lan Chen,et al.  Knowle: A semantic link network based system for organizing large scale online news events , 2015, Future Gener. Comput. Syst..

[19]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Shunxiang Zhang,et al.  Mining temporal explicit and implicit semantic relations between entities using web search engines , 2014, Future Gener. Comput. Syst..

[22]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[23]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.