Indonesian Online News Extraction and Clustering Using Evolving Clustering

43,000 online media outlets in Indonesia publish at least one to two stories every hour. The amount of information exceeds human processing capacity, resulting in several impacts for humans, such as confusion and psychological pressure. This study proposes the Evolving Clustering method that continually adapts existing model knowledge in the real, ever-evolving environment without re-clustering the data. This study also proposes feature extraction with vector space-based stemming features to improve Indonesian language stemming. The application of the system consists of seven stages, (1) Data Acquisition, (2) Data Pipeline, (3) Keyword Feature Extraction, (4) Data Aggregation, (5) Predefined Cluster using Automatic Clustering algorithm, (6) Evolving Clustering, and (7) News Clustering Result. The experimental results show that Automatic Clustering generated 388 clusters as predefined clusters from 3.000 news. One of them is the unknown cluster. Evolving clustering runs for two days to cluster the news by streaming, resulting in a total of 611 clusters. Evolving clustering goes well, both updating models and adding models. The performance of the Evolving Clustering algorithm is quite good, as evidenced by the cluster accuracy value of 88%. However, some clusters are not right. It should be re-evaluated in the keyword feature extraction process to extract the appropriate features for grouping. In the future, this method can be developed further by adding other functions, updating and adding to the model, and evaluating.

[1]  Seyed Abolghasem Mirroshandel,et al.  A novel combinatorial merge-split approach for automatic clustering using imperialist competitive algorithm , 2019, Expert Syst. Appl..

[2]  Frank M. Schneider,et al.  Too much information? Predictors of information overload in the context of online news exposure , 2017 .

[3]  P. Alam,et al.  H , 1887, High Explosives, Propellants, Pyrotechnics.

[4]  Marti A. Hearst,et al.  newsLens: building and visualizing long-ranging news stories , 2017, NEWS@ACL.

[5]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[6]  Joel Azzopardi,et al.  Incremental Clustering of News Reports , 2012, Algorithms.

[7]  Ali Ridho Barakbah,et al.  Automatic Representative News Generation using On-Line Clustering , 2013 .

[8]  A. S. M. Romli Jurnalistik Online : Panduan mengelola media online , 2018 .

[9]  Ali Ridho Barakbah,et al.  Reversed pattern of moving variance for accelerating automatic clustering , 2004 .

[10]  Kamal Z. Zamli,et al.  A buffer-based online clustering for evolving data stream , 2019, Inf. Sci..

[11]  Bettina Berendt,et al.  Peddling or Creating? Investigating the Role of Twitter in News Reporting , 2011, ECIR.

[12]  Hugh E. Williams,et al.  Stemming Indonesian , 2005, ACSC.

[13]  Ali Ridho Barakbah,et al.  Automatic Representative News Generation using Automatic Clustering , 2012 .

[14]  Hugh E. Williams,et al.  Stemming Indonesian: A confix-stripping approach , 2007, TALIP.

[15]  Ali Ridho Barakbah,et al.  Cluster-Based News Representative Generation with Automatic Incremental Clustering , 2019 .

[16]  Afian Syafaadi Rizki,et al.  Comparison of stemming algorithms on Indonesian text processing , 2019 .

[17]  Deepali Virmani,et al.  A Text Preprocessing Approach for Efficacious Information Retrieval , 2018, Smart Innovations in Communication and Computational Sciences.

[18]  Mohamed A. Ismail,et al.  Efficient incremental density-based algorithm for clustering large datasets , 2015 .