Biomedical Text Categorization Based on Ensemble Pruning and Optimized Topic Modelling

Text mining is an important research direction, which involves several fields, such as information retrieval, information extraction, and text categorization. In this paper, we propose an efficient multiple classifier approach to text categorization based on swarm-optimized topic modelling. The Latent Dirichlet allocation (LDA) can overcome the high dimensionality problem of vector space model, but identifying appropriate parameter values is critical to performance of LDA. Swarm-optimized approach estimates the parameters of LDA, including the number of topics and all the other parameters involved in LDA. The hybrid ensemble pruning approach based on combined diversity measures and clustering aims to obtain a multiple classifier system with high predictive performance and better diversity. In this scheme, four different diversity measures (namely, disagreement measure, Q-statistics, the correlation coefficient, and the double fault measure) among classifiers of the ensemble are combined. Based on the combined diversity matrix, a swarm intelligence based clustering algorithm is employed to partition the classifiers into a number of disjoint groups and one classifier (with the highest predictive performance) from each cluster is selected to build the final multiple classifier system. The experimental results based on five biomedical text benchmarks have been conducted. In the swarm-optimized LDA, different metaheuristic algorithms (such as genetic algorithms, particle swarm optimization, firefly algorithm, cuckoo search algorithm, and bat algorithm) are considered. In the ensemble pruning, five metaheuristic clustering algorithms are evaluated. The experimental results on biomedical text benchmarks indicate that swarm-optimized LDA yields better predictive performance compared to the conventional LDA. In addition, the proposed multiple classifier system outperforms the conventional classification algorithms, ensemble learning, and ensemble pruning methods.

[1]  Mahdi Eftekhari,et al.  A new ensemble learning methodology based on hybridization of classifier ensemble selection approaches , 2015, Appl. Soft Comput..

[2]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Huaxiang Zhang,et al.  A spectral clustering based ensemble pruning approach , 2014, Neurocomputing.

[5]  Francisco Herrera,et al.  Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets , 2016, Inf. Sci..

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Alex A T Bui,et al.  Clinical Case-based Retrieval Using Latent Topic Analysis. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[8]  Franciska de Jong,et al.  ADM-LDA: An aspect detection model based on topic modelling using the structure of review sentences , 2014, J. Inf. Sci..

[9]  Denys Poshyvanyk,et al.  Using Latent Dirichlet Allocation for automatic categorization of software , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[10]  Kun Fu,et al.  Joint model for subsentence‐level sentiment analysis with Markov logic , 2015, J. Assoc. Inf. Sci. Technol..

[11]  James J. Chen,et al.  Text mining for identifying topics in the literatures about adolescent substance use and depression , 2016, BMC Public Health.

[12]  Xiaowei Xu,et al.  Mining FDA drug labels using an unsupervised learning technique - topic modeling , 2011, BMC Bioinformatics.

[13]  Aytug Onan,et al.  A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification , 2017, Inf. Process. Manag..

[14]  Min Song,et al.  Detecting the knowledge structure of bioinformatics by mining full-text collections , 2012, Scientometrics.

[15]  Xiaoyan Zhu,et al.  Extract interaction detection methods from the biological literature , 2009, BMC Bioinformatics.

[16]  Simon Fong,et al.  WITHDRAWN: Benchmarking swarm intelligence clustering algorithms with case study of medical data. , 2016, Computerized medical imaging and graphics : the official journal of the Computerized Medical Imaging Society.

[17]  Tao Liu,et al.  BioTopic: a topic-driven biological literature mining system , 2016, Int. J. Data Min. Bioinform..

[18]  Xin-She Yang,et al.  Engineering optimisation by cuckoo search , 2010, Int. J. Math. Model. Numer. Optimisation.

[19]  Pavlos Protopapas,et al.  Optimizing the Multiclass F-Measure via Biconcave Programming , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[20]  Aytug Onan,et al.  An improved ant algorithm with LDA-based representation for text document clustering , 2017, J. Inf. Sci..

[21]  Quan Sun,et al.  Bagging Ensemble Selection , 2011, Australasian Conference on Artificial Intelligence.

[22]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[23]  Shaowen Yao,et al.  An overview of topic modeling and its current applications in bioinformatics , 2016, SpringerPlus.

[24]  Carlo A. Trugenberger,et al.  Discovery of novel biomarkers and phenotypes by semantic technologies , 2012, BMC Bioinformatics.

[25]  Aytug Onan,et al.  LDA-based Topic Modelling in Text Sentiment Classification: An Empirical Analysis , 2016, Int. J. Comput. Linguistics Appl..

[26]  Scott Grant,et al.  Estimating the Optimal Number of Latent Concepts in Source Code Analysis , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[27]  Thiago J. M. Moura,et al.  Combining diversity measures for ensemble pruning , 2016, Pattern Recognit. Lett..

[28]  Qingqi Hong,et al.  Topic evolution based on LDA and HMM and its application in stem cell research , 2014, J. Inf. Sci..

[29]  Bogdan Gabrys,et al.  Application of the Evolutionary Algorithms for Classifier Selection in Multiple Classifier Systems with Majority Voting , 2001, Multiple Classifier Systems.

[30]  Wei Tang,et al.  Selective Ensemble of Decision Trees , 2003, RSFDGrC.

[31]  Bin Zhou,et al.  Fuzzy Approach Topic Discovery in Health and Medical Corpora , 2017, Int. J. Fuzzy Syst..

[32]  George Hripcsak,et al.  Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[33]  Bob Rehder,et al.  How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans , 1997 .

[34]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[35]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[36]  Xin-She Yang,et al.  Nature-Inspired Metaheuristic Algorithms , 2008 .

[37]  Bin Liu,et al.  Survey on data science with population-based algorithms , 2016 .

[38]  Eréndira Rendón,et al.  A comparison of internal and external cluster validation indexes , 2011 .

[39]  Reinhard Schneider,et al.  Martini: using literature keywords to compare gene sets , 2009, Nucleic acids research.

[40]  Basilio Sierra,et al.  Classifier Subset Selection to construct multi-classifiers by means of estimation of distribution algorithms , 2015, Neurocomputing.

[41]  Matti Aksela,et al.  Comparison of Classifier Selection Methods for Improving Committee Performance , 2003, Multiple Classifier Systems.

[42]  Rosa L. Figueroa,et al.  Extracting Information from Electronic Medical Records to Identify the Obesity Status of a Patient Based on Comorbidities and Bodyweight Measures , 2016, Journal of Medical Systems.

[43]  Hua Xu,et al.  Constrained LDA for Grouping Product Features in Opinion Mining , 2011, PAKDD.

[44]  Dale J. Poirier,et al.  Intermediate Statistics and Econometrics: A Comparative Approach , 1995 .

[45]  Weizhong Zhao,et al.  A heuristic approach to determine an appropriate number of topics in topic modeling , 2015, BMC Bioinformatics.

[46]  Ata Kabán,et al.  Sequential Activity Profiling: Latent Dirichlet Allocation of Markov Chains , 2005, Data Mining and Knowledge Discovery.

[47]  Wei Tang,et al.  Ensembling neural networks: Many could be better than all , 2002, Artif. Intell..

[48]  Ahmed Enayetallah,et al.  GeneTopics - interpretation of gene sets via literature-driven topic models , 2013, BMC Systems Biology.

[49]  Jay Urbain,et al.  Mining heart disease risk factors in clinical text with named entity recognition and distributional semantic models , 2015, J. Biomed. Informatics.

[50]  Fabio Roli,et al.  Methods for Designing Multiple Classifier Systems , 2001, Multiple Classifier Systems.

[51]  Xiaowei Xu,et al.  Investigating drug repositioning opportunities in FDA drug labels through topic modeling , 2012, BMC Bioinformatics.

[52]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[53]  Dushmanta Kumar Das,et al.  A New Class Topper Optimization Algorithm with an Application to Data Clustering , 2020, IEEE Transactions on Emerging Topics in Computing.

[54]  Hyeong-Ah Choi,et al.  Topic Modeling Based Classification of Clinical Reports , 2013, ACL.

[55]  Ting Zhang,et al.  A new reverse reduce-error ensemble pruning algorithm , 2015, Appl. Soft Comput..

[56]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[57]  Raul Rodriguez-Esteban,et al.  Biomedical Text Mining and Its Applications , 2009, PLoS Comput. Biol..

[58]  R. Anitha,et al.  Malware detection by pruning of parallel ensembles using harmony search , 2013, Pattern Recognit. Lett..

[59]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[60]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[61]  Alípio Mário Jorge,et al.  Ensemble approaches for regression: A survey , 2012, CSUR.

[62]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.

[63]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[64]  Dirk Van,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[65]  Yuping Wang,et al.  A systems approach for analysis of high content screening assay data with topic modeling , 2013, BMC Bioinformatics.

[66]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[67]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[68]  Guoyin Wang,et al.  Rough Sets, Fuzzy Sets, Data Mining and Granular Computing , 2011, Lecture Notes in Computer Science.

[69]  Xin-She Yang,et al.  A New Metaheuristic Bat-Inspired Algorithm , 2010, NICSO.

[70]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[71]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[72]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[73]  L. Williams,et al.  Contents , 2020, Ophthalmology (Rochester, Minn.).

[74]  Aytug Onan,et al.  A feature selection model based on genetic rank aggregation for text sentiment classification , 2017, J. Inf. Sci..

[75]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[76]  Russ B. Altman,et al.  Predicting inpatient clinical order patterns with probabilistic topic models vs conventional order sets , 2016, J. Am. Medical Informatics Assoc..

[77]  Yee Whye Teh,et al.  Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes , 2004, NIPS.

[78]  Abeed Sarker,et al.  Finding Potentially Unsafe Nutritional Supplements from User Reviews with Topic Modeling , 2016, PSB.