论文信息 - Two-Stage Topic Extraction Model for Bibliometric Data Analysis Based on Word Embeddings and Clustering

Two-Stage Topic Extraction Model for Bibliometric Data Analysis Based on Word Embeddings and Clustering

Topic extraction is an essential task in bibliometric data analysis, data mining and knowledge discovery, which seeks to identify significant topics from text collections. The conventional topic extraction schemes require human intervention and involve also comprehensive pre-processing tasks to represent text collections in an appropriate way. In this paper, we present a two-stage framework for topic extraction from scientific literature. The presented scheme employs a two-staged procedure, where word embedding schemes have been utilized in conjunction with cluster analysis. To extract significant topics from text collections, we propose an improved word embedding scheme, which incorporates word vectors obtained by word2vec, POS2vec, word-position2vec and LDA2vec schemes. In the clustering phase, an improved clustering ensemble framework, which incorporates conventional clustering methods (i.e., k-means, k-modes, k-means++, self-organizing maps and DIANA algorithm) by means of the iterative voting consensus, has been presented. In the empirical analysis, we analyze a corpus containing 160,424 abstracts of articles from various disciplines, including agricultural engineering, economics, engineering and computer science. In the experimental analysis, performance of the proposed scheme has been compared to conventional baseline clustering methods (such as, k-means, k-modes, and k-means++), LDA-based topic modelling and conventional word embedding schemes. The empirical analysis reveals that ensemble word embedding scheme yields better predictive performance compared to the baseline word vectors for topic extraction. Ensemble clustering framework outperforms the baseline clustering methods. The results obtained by the proposed framework show an improvement in Jaccard coefficient, Folkes & Mallows measure and F1 score.

Aytuğ Onan | Aytuğ Onan

[1] Rich Caruana,et al. Consensus Clusterings , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[2] Alan L. Porter,et al. Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research , 2016 .

[3] Aytug Onan,et al. Consensus Clustering-Based Undersampling Approach to Imbalanced Learning , 2019, Sci. Program..

[4] Usman Qamar,et al. Heterogeneous classifiers fusion for dynamic breast cancer diagnosis using weighted vote based ensemble , 2015 .

[5] Kevin W. Boyack,et al. Comparison of topic extraction approaches and their results , 2017, Scientometrics.

[6] H. P. F. Peters,et al. Co-word-based science maps of chemical engineering. Part I: Representations by direct multidimensional scaling , 1993 .

[7] Zhaoyang Qu,et al. Text Representation Based on Key Terms of Document for Text Categorization , 2016 .

[8] Asif Ekbal,et al. Weighted Vote-Based Classifier Ensemble for Named Entity Recognition: A Genetic Algorithm-Based Approach , 2011, TALIP.

[9] Thomas Demeester,et al. Representation learning for very short texts using weighted word embedding aggregation , 2016, Pattern Recognit. Lett..

[10] Arho Suominen,et al. Modeling : Comparison of Unsupervised Learning and Human-Assigned Subject Classification , 2015 .

[11] Robert Tibshirani,et al. Hybrid hierarchical clustering with applications to microarray data. , 2005, Biostatistics.