A Multi-Objective Optimization-based Clustering Approach for CORD-19 Scholarly Articles
暂无分享,去创建一个
The pandemic disease COVID-19, originated from the SARS-CoV-2 virus has spread globally. Researchers are working tirelessly on areas including studying the transmission of COVID-19, promoting its identification, designing new vaccines and therapies, and recognizing its socio-economic consequences. This extensive research leads to the exploration of thousands of scientific papers related to biology, chemistry, genetics, health, and economy. Therefore, it is essential to develop an intelligent text mining technique for segregating this rich source of data to perform easy access, information retrieval, and interpretation within minimum time and resources. We propose a multi-objective optimization-based document clustering approach for the CORD-19 (COVID-19 Open Research Dataset) dataset in this paper. Here, a new technique utilizing BioBERT has been proposed, which benefits from the abstract and the document text, rather than only the brief abstract, to perceive a concise understanding of the text to generate clusters with better definitions. The main contributions of the proposed work are two-fold: in the first step, we have used BioBERT to generate the sentence embedding which is further used for the document representation. In the next step, we have developed a multi-objective optimization (MOO) based clustering algorithm for grouping the generated document vector representations. In this MOO-based clustering, we have used Non-dominated Sorting Genetic Algorithm-II and Fuzzy c-means algorithm as the underlying MOO and clustering technique, respectively. This model is evaluated using the Silhouette Score (Silhouette score) and Calinski-Harabasz index (CH index), and the clustering solutions are visualized using word clouds. The clustering results exhibit significant improvements over various other existing clustering models.