Fast Clustering of Short Text Streams Using Efficient Cluster Indexing and Dynamic Similarity Thresholds

Short text stream clustering is an important but challenging task since massive amount of text is generated from different sources such as micro-blogging, question-answering, and social news aggregation websites. One of the major challenges of clustering such massive amount of text is to cluster them within a reasonable amount of time. The existing state-of-the-art short text stream clustering methods can not cluster such massive amount of text within a reasonable amount of time as they compute similarities between a text and all the existing clusters to assign that text to a cluster. To overcome this challenge, we propose a fast short text stream clustering method (called FastStream) that efficiently index the clusters using inverted index and compute similarity between a text and a selected number of clusters while assigning a text to a cluster. In this way, we not only reduce the running time of our proposed method but also reduce the running time of several state-of-the-art short text stream clustering methods. FastStream assigns a text to a cluster (new or existing) using the dynamically computed similarity thresholds based on statistical measure. Thus our method efficiently deals with the concept drift problem. Experimental results demonstrate that FastStream outperforms the state-of-the-art short text stream clustering methods by a significant margin on several short text datasets. In addition, the running time of FastStream is several orders of magnitude faster than that of the state-of-the-art methods.

[1]  Katrin Erk,et al.  Vector Space Models of Word Meaning and Phrase Meaning: A Survey , 2012, Lang. Linguistics Compass.

[2]  Weiwen Liu,et al.  A Dirichlet process biterm-based mixture model for short text stream clustering , 2020, Applied Intelligence.

[3]  Argyris Kalogeratos,et al.  Improving Text Stream Clustering using Term Burstiness and Co-burstiness , 2016, SETN.

[4]  M. de Rijke,et al.  Explainable User Clustering in Short Text Streams , 2016, SIGIR.

[5]  Norbert Zeh,et al.  Short Text Stream Clustering via Frequent Word Pairs and Reassignment of Outliers to Clusters , 2020, DocEng.

[6]  Matthias Carnein,et al.  Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms , 2019, Bus. Inf. Syst. Eng..

[7]  Milos Ilic,et al.  Inverted index search in data mining , 2014, 2014 22nd Telecommunications Forum Telfor (TELFOR).

[8]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[9]  Wei Zhang,et al.  Model-based Clustering of Short Text Streams , 2018, KDD.

[10]  Jay Kumar,et al.  An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering , 2020, ACL.

[11]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[12]  Evangelos Kanoulas,et al.  Dynamic Clustering of Streaming Short Documents , 2016, KDD.

[13]  Zhenhua Wang,et al.  Sumblr: continuous summarization of evolving tweet streams , 2013, SIGIR.

[14]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .