Two-Stage Topic Extraction Model for Bibliometric Data Analysis Based on Word Embeddings and Clustering

Topic extraction is an essential task in bibliometric data analysis, data mining and knowledge discovery, which seeks to identify significant topics from text collections. The conventional topic extraction schemes require human intervention and involve also comprehensive pre-processing tasks to represent text collections in an appropriate way. In this paper, we present a two-stage framework for topic extraction from scientific literature. The presented scheme employs a two-staged procedure, where word embedding schemes have been utilized in conjunction with cluster analysis. To extract significant topics from text collections, we propose an improved word embedding scheme, which incorporates word vectors obtained by word2vec, POS2vec, word-position2vec and LDA2vec schemes. In the clustering phase, an improved clustering ensemble framework, which incorporates conventional clustering methods (i.e., k-means, k-modes, k-means++, self-organizing maps and DIANA algorithm) by means of the iterative voting consensus, has been presented. In the empirical analysis, we analyze a corpus containing 160,424 abstracts of articles from various disciplines, including agricultural engineering, economics, engineering and computer science. In the experimental analysis, performance of the proposed scheme has been compared to conventional baseline clustering methods (such as, k-means, k-modes, and k-means++), LDA-based topic modelling and conventional word embedding schemes. The empirical analysis reveals that ensemble word embedding scheme yields better predictive performance compared to the baseline word vectors for topic extraction. Ensemble clustering framework outperforms the baseline clustering methods. The results obtained by the proposed framework show an improvement in Jaccard coefficient, Folkes & Mallows measure and F1 score.

[1]  Rich Caruana,et al.  Consensus Clusterings , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[2]  Alan L. Porter,et al.  Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research , 2016 .

[3]  Aytug Onan,et al.  Consensus Clustering-Based Undersampling Approach to Imbalanced Learning , 2019, Sci. Program..

[4]  Usman Qamar,et al.  Heterogeneous classifiers fusion for dynamic breast cancer diagnosis using weighted vote based ensemble , 2015 .

[5]  Kevin W. Boyack,et al.  Comparison of topic extraction approaches and their results , 2017, Scientometrics.

[6]  H. P. F. Peters,et al.  Co-word-based science maps of chemical engineering. Part I: Representations by direct multidimensional scaling , 1993 .

[7]  Zhaoyang Qu,et al.  Text Representation Based on Key Terms of Document for Text Categorization , 2016 .

[8]  Asif Ekbal,et al.  Weighted Vote-Based Classifier Ensemble for Named Entity Recognition: A Genetic Algorithm-Based Approach , 2011, TALIP.

[9]  Thomas Demeester,et al.  Representation learning for very short texts using weighted word embedding aggregation , 2016, Pattern Recognit. Lett..

[10]  Arho Suominen,et al.  Modeling : Comparison of Unsupervised Learning and Human-Assigned Subject Classification , 2015 .

[11]  Robert Tibshirani,et al.  Hybrid hierarchical clustering with applications to microarray data. , 2005, Biostatistics.

[12]  Alan L. Porter,et al.  “Term clumping” for technical intelligence: A case study on dye-sensitized solar cells , 2014 .

[13]  Katja Hofmann,et al.  A Comparative Study of Features for Keyphrase Extraction in Scientific Literature , 2009 .

[14]  Petros Xanthopoulos,et al.  Estimating the number of clusters in a dataset via consensus clustering , 2019, Expert Syst. Appl..

[15]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[16]  Joydeep Ghosh,et al.  Cluster ensembles , 2011, Data Clustering: Algorithms and Applications.

[17]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[18]  Danqi Chen,et al.  Reasoning With Neural Tensor Networks for Knowledge Base Completion , 2013, NIPS.

[19]  Chaomei Chen,et al.  Dynamic topic detection and tracking: A comparison of HDP, C‐word, and cocitation methods , 2014, J. Assoc. Inf. Sci. Technol..

[20]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[21]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[22]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[23]  Mehran Kamkarhaghighi,et al.  Content Tree Word Embedding for document representation , 2017, Expert Syst. Appl..

[24]  Radu Tudor Ionescu,et al.  Vector of Locally-Aggregated Word Embeddings (VLAWE): A Novel Document-level Representation , 2019, NAACL.

[25]  Kevin W. Boyack,et al.  Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches , 2011, PloS one.

[26]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[27]  Florian Boudin,et al.  TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction , 2013, IJCNLP.

[28]  Haoran Xie,et al.  A Weighted Word Embedding Model for Text Classification , 2019, DASFAA.

[29]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[31]  Michael I. Jordan,et al.  Automatic Hilghter of Lengthy Legal Documents , 2015 .

[32]  Amaury Lendasse,et al.  Discriminant document embeddings with an extreme learning machine for classifying clinical narratives , 2018, Neurocomputing.

[33]  Jack G. Conrad,et al.  Legal document clustering with built-in topic segmentation , 2011, CIKM '11.

[34]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[35]  Aytug Onan,et al.  Ensemble of keyword extraction methods and classifiers in text classification , 2016, Expert Syst. Appl..

[36]  Aytug Onan,et al.  A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification , 2017, Inf. Process. Manag..

[37]  Franciska de Jong,et al.  ADM-LDA: An aspect detection model based on topic modelling using the structure of review sentences , 2014, J. Inf. Sci..

[38]  Erik Cambria,et al.  A Deeper Look into Sarcastic Tweets Using Deep Convolutional Neural Networks , 2016, COLING.

[39]  Radu Tudor Ionescu,et al.  From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings , 2017, KES.

[40]  Ji Zhang,et al.  Locally Embedding Autoencoders: A Semi-Supervised Manifold Learning Approach of Document Representation , 2016, PloS one.

[41]  Enrico Glaab,et al.  Analysing functional genomics data using novel ensemble, consensus and data fusion techniques , 2011 .

[42]  Tossapon Boongoen,et al.  Cluster ensembles: A survey of approaches with recent extensions and applications , 2018, Comput. Sci. Rev..

[43]  Craig W. Schmidt,et al.  Improving a tf-idf weighted document vector embedding , 2019, ArXiv.

[44]  Alan L. Porter,et al.  Does deep learning help topic extraction? A kernel k-means clustering method with word embedding , 2018, J. Informetrics.

[45]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[46]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[47]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[48]  Silvia Salini,et al.  Ten challenges in modeling bibliographic data for bibliometric analysis , 2012, Scientometrics.

[49]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[50]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[51]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[52]  Paul J. Kennedy,et al.  An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit , 2020, Inf. Process. Manag..

[53]  小林 和雄 図書紹介:『Handbook of Quantitative Studies of Science and Technology』 , 1990 .

[54]  Hadi Veisi,et al.  Sentiment analysis based on improved pre-trained word embeddings , 2019, Expert Syst. Appl..

[55]  Christopher E. Moody,et al.  Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec , 2016, ArXiv.

[56]  Ye Zhang,et al.  A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.