Automatic Scientific Document Clustering Using Self-organized Multi-objective Differential Evolution

Document clustering is the partitioning of a given collection of documents into various K- groups based on some similarity/dissimilarity criterion. This task has applications in scope detection of journals/conferences, development of some automated peer-review support systems, topic-modeling, latest cognitive-inspired works on text summarization, and classification of documents based on semantics, etc. In the current paper, a cognitive-inspired multi-objective automatic document clustering technique is proposed which is a fusion of self-organizing map (SOM) and multi-objective differential evolution approach. The variable number of cluster centers are encoded in different solutions of the population to determine the number of clusters from a data set in an automated way. These solutions undergo various genetic operations during evolution. The concept of SOM is utilized in designing new genetic operators for the proposed clustering technique. In order to measure the goodness of a clustering solution, two cluster validity indices, Pakhira-Bandyopadhyay-Maulik index, and Silhouette index, are optimized simultaneously. The effectiveness of the proposed approach, namely self-organizing map based multi-objective document clustering technique (SMODoc_clust) is shown in automatic classification of some scientific articles and web-documents. Different representation schemas including tf, tf-idf and word-embedding are employed to convert articles in vector-forms. Comparative results with respect to internal cluster validity indices, namely, Dunn index and Davies-Bouldin index, are shown against several state-of-the-art clustering techniques including three multi-objective clustering techniques namely MOCK, VAMOSA, NSGA-II-Clust, single objective genetic algorithm (SOGA) based clustering technique, K-means, and single-linkage clustering. Results obtained clearly show that our approach is better than existing approaches. The validation of the obtained results is also shown using statistical significant t tests.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Dunja Mladenic,et al.  Visualization of Text Document Corpus , 2005, Informatica.

[3]  Pushpak Bhattacharyya,et al.  A Self Organizing Map Based Multi-objective Framework for Automatic Evolution of Clusters , 2017, ICONIP.

[4]  Dmitri Roussinov,et al.  A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation , 1998 .

[5]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Vishal Gupta,et al.  An Efficient Corpus-Based Stemmer , 2017, Cognitive Computation.

[7]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[8]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[11]  Xiao Zhi Gao,et al.  Self-organizing multiobjective optimization based on decomposition with neighborhood ensemble , 2016, Neurocomputing.

[12]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[13]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[14]  Swagatam Das,et al.  Automatic Clustering Using an Improved Differential Evolution Algorithm , 2007 .

[15]  Francis Narin,et al.  Clustering of scientific journals , 1973, J. Am. Soc. Inf. Sci..

[16]  Sanghamitra Bandyopadhyay,et al.  GAPS: A clustering method using a new point symmetry-based distance measure , 2007, Pattern Recognit..

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Joshua D. Knowles,et al.  An Evolutionary Approach to Multiobjective Clustering , 2007, IEEE Transactions on Evolutionary Computation.

[19]  Ujjwal Maulik,et al.  A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA , 2008, IEEE Transactions on Evolutionary Computation.

[20]  Sanghamitra Bandyopadhyay,et al.  A New Principal Axis Based Line Symmetry Measurement and Its Application to Clustering , 2008, ICONIP.

[21]  Charu C. Aggarwal,et al.  Mining Text Data , 2012, Springer US.

[22]  Qingfu Zhang,et al.  A Self-Organizing Multiobjective Evolutionary Algorithm , 2016, IEEE Transactions on Evolutionary Computation.

[23]  Artur Starczewski,et al.  A new validity index for crisp clusters , 2017, Pattern Analysis and Applications.

[24]  Ujjwal Maulik,et al.  Validity index for crisp and fuzzy clusters , 2004, Pattern Recognit..

[25]  Qasem A. Al-Radaideh,et al.  A Hybrid Approach for Arabic Text Summarization Using Domain Knowledge and Genetic Algorithms , 2018, Cognitive Computation.

[26]  Csaba Legány,et al.  Cluster validity measurement techniques , 2006 .

[27]  Sriparna Saha,et al.  Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering , 2017, Comput. Biol. Medicine.

[28]  Sanghamitra Bandyopadhyay,et al.  Some connectivity based cluster validity indices , 2012, Appl. Soft Comput..

[29]  S. Bandyopadhyay,et al.  Nonparametric genetic clustering: comparison of validity indices , 2001, IEEE Trans. Syst. Man Cybern. Syst..

[30]  Xiaodong Li,et al.  Rotated Problems and Rotationally Invariant Crossover in Evolutionary Multi-Objective Optimization , 2008, Int. J. Comput. Intell. Appl..

[31]  Vishal Gupta,et al.  A Novel Technique for Detecting Plagiarism in Documents Exploiting Information Sources , 2017, Cognitive Computation.

[32]  P. Buitelaar,et al.  Topic Extraction from Scientific Literature for Competency Management , 2022 .

[33]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[34]  Ian H. Witten,et al.  Importing Documents and Metadata into Digital Libraries: Requirements Analysis and an Extensible Architecture , 2002, ECDL.

[35]  Sanghamitra Bandyopadhyay,et al.  A symmetry based multiobjective clustering technique for automatic evolution of clusters , 2010, Pattern Recognit..

[36]  R. Storn,et al.  Differential Evolution: A Practical Approach to Global Optimization (Natural Computing Series) , 2005 .

[37]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[38]  Pushpak Bhattacharyya,et al.  Cascaded SOM: An Improved Technique for Automatic Email Classification , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[39]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[40]  David W. Coit,et al.  Multi-objective optimization using genetic algorithms: A tutorial , 2006, Reliab. Eng. Syst. Saf..

[41]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[42]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[43]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Ana Margarida de Jesus,et al.  Improving Methods for Single-label Text Categorization , 2007 .

[45]  Sriparna Saha,et al.  A generalized automatic clustering algorithm in a multiobjective framework , 2013, Appl. Soft Comput..

[46]  Vishal Gupta,et al.  A Novel Hybrid Text Summarization System for Punjabi Text , 2015, Cognitive Computation.

[47]  Kalyanmoy Deb,et al.  Omni-optimizer: A generic evolutionary algorithm for single and multi-objective optimization , 2008, Eur. J. Oper. Res..

[48]  Martti Juhola,et al.  Stemming and lemmatization in the clustering of finnish text documents , 2004, CIKM '04.

[49]  Ujjwal Maulik,et al.  Multiobjective Genetic Clustering for Pixel Classification in Remote Sensing Imagery , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[50]  Carla E. Brodley,et al.  Discovering Better AAAI Keywords via Clustering with Community-Sourced Constraints , 2014, AAAI.

[51]  Kay Chen Tan,et al.  A multiobjective evolutionary algorithm using dynamic weight design method , 2012 .

[52]  Amaury Lendasse,et al.  Generating Word Embeddings from an Extreme Learning Machine for Sentiment Analysis and Sequence Labeling Tasks , 2018, Cognitive Computation.

[53]  Ujjwal Maulik,et al.  Genetic clustering for automatic evolution of clusters and application to image classification , 2002, Pattern Recognit..

[54]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[55]  Mohamed S. Kamel,et al.  Enhanced bisecting k-means clustering using intermediate cooperation , 2009, Pattern Recognit..

[56]  Quan Pan,et al.  Learning Word Representations for Sentiment Analysis , 2017, Cognitive Computation.

[57]  R. K. Ursem Multi-objective Optimization using Evolutionary Algorithms , 2009 .

[58]  Sanghamitra Bandyopadhyay,et al.  A Point Symmetry-Based Clustering Technique for Automatic Evolution of Clusters , 2008, IEEE Transactions on Knowledge and Data Engineering.

[59]  Wei Luo,et al.  Classification of Chinese Texts Based on Recognition of Semantic Topics , 2015, Cognitive Computation.

[60]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[61]  Ajith Abraham,et al.  Data Clustering Using Multi-objective Differential Evolution Algorithms , 2009, Fundam. Informaticae.

[62]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..