Hot Topic Detection on Twitter Data Streams with Incremental Clustering Using Named Entities and Central Centroids

Nowadays, hot topic detection is an important knowledge discovery task on social data streams to determine hot topics that are being discussed the most on a social platform. One of the main challenges of this task is the processing of a very large social dataset in a sequence order efficiently and effectively. In this paper, our work proposes a novel incremental clustering-based solution with named entities and central centroids for hot topic detection on Twitter data streams. Using named entities, each tweet and its cluster can be semantically represented. They are then efficiently used for searching in noise removal and incremental clustering. After that, clusters of higher quality can be formed and lay the basis for deriving true hot topics. Besides, central centroids are defined instead of normal centroids to speed up our incremental clustering process. As a result, our solution is more efficient and effective than the other approaches in the existing works as shown in an empirical evaluation on Twitter Events2012 dataset. Indeed, our entity and central centroid-based incremental clustering method outperforms the others with Recall of about 0.92, Normalized Mutual Information of about 0.85, and execution time of about 30 minutes on processing 500,000 tweets with more than 500 topics. Such better results confirm the appropriateness of our new method design features for hot topic detection on Twitter data streams.

[1]  Mehmet A. Orgun,et al.  TwitterNews+: A Framework for Real Time Event Detection from the Twitter Data Stream , 2016, SocInfo.

[2]  Xiaomo Liu,et al.  Real-Time Novel Event Detection from Social Media , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[3]  Mehmet A. Orgun,et al.  TwitterNews: Real time event detection from the Twitter data stream , 2016, PeerJ Prepr..

[4]  Kostas Stefanidis,et al.  Multi-aspect Entity-Centric Analysis of Big Social Media Archives , 2017, TPDL.

[5]  Mario Cataldi,et al.  Emerging topic detection on Twitter based on temporal and social terms evaluation , 2010, MDMKDD '10.

[6]  Dimitrios Gunopulos,et al.  Detecting Events in Online Social Networks: Definitions, Trends and Challenges , 2016, Solving Large Scale Learning Tasks.

[7]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[8]  Yunming Ye,et al.  Detecting hot topics from Twitter: A multiview approach , 2014, J. Inf. Sci..

[9]  Joemon M. Jose,et al.  Building a large-scale corpus for evaluating event detection on twitter , 2013, CIKM.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Joemon M. Jose,et al.  Real-Time Entity-Based Event Detection for Twitter , 2015, CLEF.

[12]  Yue Zhang,et al.  Frame-Based Representation for Event Detection on Twitter , 2018, IEICE Trans. Inf. Syst..

[13]  Pericles A. Mitkas,et al.  Event identification in web social media through named entity recognition and topic modeling , 2013, Data Knowl. Eng..

[14]  S. Godfrey Winster,et al.  Event identification in social media through latent dirichlet allocation and named entity recognition , 2014, Proceedings of IEEE International Conference on Computer Communication and Systems ICCCS14.

[15]  Vijay V. Raghavan,et al.  Detection of event onset using Twitter , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[16]  Charu C. Aggarwal,et al.  Event Detection in Social Streams , 2012, SDM.

[17]  Jeongkyu Lee,et al.  Event detection on large social media using temporal analysis , 2017, 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC).

[18]  Jianxin Li,et al.  Event Detection and Evolution Based on Knowledge Base , 2017 .

[19]  Mehmet A. Orgun,et al.  Real-time event detection from the Twitter data stream using the TwitterNews+ Framework , 2019, Inf. Process. Manag..

[20]  Xun Wang,et al.  Real Time Event Detection in Twitter , 2013, WAIM.

[21]  Kuan-Yu Chen,et al.  Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling , 2007, IEEE Transactions on Knowledge and Data Engineering.