Parallel gene selection and dynamic ensemble pruning based on Affinity Propagation

Gene selection and sample classification based on gene expression data are important research areas in bioinformatics. Selecting important genes closely related to classification is a challenging task due to high dimensionality and small sample size of microarray data. Extended rough set based on neighborhood has been successfully applied to gene selection, as it can select attributes without redundancy and deal with numerical attributes directly. However, the computation of approximations in rough set is extremely time consuming. In this paper, in order to accelerate the process of gene selection, a parallel computation method is proposed to calculate approximations of intersection neighborhood rough set. Furthermore, a novel dynamic ensemble pruning approach based on Affinity Propagation clustering and dynamic pruning framework is proposed to reduce memory usage and computational cost. Experimental results on three Arabidopsis thaliana biotic and abiotic stress response datasets demonstrate that the proposed method can obtain better classification performance than ensemble method with gene pre-selection.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Rui Li,et al.  Classification by integrating plant stress response gene expression data with biological knowledge. , 2015, Mathematical biosciences.

[3]  Yun Zhu,et al.  Efficient parallel boolean matrix based algorithms for computing composite rough set approximations , 2016, Inf. Sci..

[4]  Jing Zhang,et al.  Gene selection using rough set based on neighborhood for the analysis of plant stress response , 2014, Appl. Soft Comput..

[5]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[6]  Yehuda Koren,et al.  Advances in Collaborative Filtering , 2011, Recommender Systems Handbook.

[7]  Elizabeth Tapia,et al.  Sparse and stable gene selection with consensus SVM-RFE , 2012, Pattern Recognit. Lett..

[8]  Brendan J. Frey,et al.  Non-metric affinity propagation for unsupervised image categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[10]  Peng Wang,et al.  Knowledge Dependency and Rule Induction on Tolerance Rough Sets , 2013, J. Multiple Valued Log. Soft Comput..

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[13]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[14]  Peng Wang,et al.  On Detecting Subtle Pathology via Tissue Clustering of Multi-parametric Data using Affinity Propagation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15]  Witold Pedrycz,et al.  Selecting Discrete and Continuous Features Based on Neighborhood Decision Error Minimization , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[16]  Sazzad Karim,et al.  Exploring plant tolerance to biotic and abiotic stresses , 2007 .

[17]  Decui Liang,et al.  Incorporating logistic regression to decision-theoretic rough sets for classifications , 2014, Int. J. Approx. Reason..

[18]  Wei Tang,et al.  Ensembling neural networks: Many could be better than all , 2002, Artif. Intell..

[19]  Qinghua Hu,et al.  Rule extraction from support vector machines based on consistent region covering reduction , 2013, Knowl. Based Syst..

[20]  Qinghua Hu,et al.  Neighborhood rough set based heterogeneous feature subset selection , 2008, Inf. Sci..

[21]  Shutao Li,et al.  Gene Selection Using Neighborhood Rough Set from Gene Expression Profiles , 2007 .

[22]  Alberto M Marchevsky,et al.  Classification of individual lung cancer cell lines based on DNA methylation markers: use of linear discriminant analysis and artificial neural networks. , 2004, The Journal of molecular diagnostics : JMD.

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  Shu-Lin Wang,et al.  Neighborhood Rough Set Reduction-Based Gene Selection and Prioritization for Gene Expression Profile Analysis and Molecular Cancer Classification , 2010, Journal of biomedicine & biotechnology.

[25]  Da Ruan,et al.  A parallel method for computing rough set approximations , 2012, Inf. Sci..

[26]  Xiaodong Yue,et al.  Parallel attribute reduction algorithms using MapReduce , 2014, Inf. Sci..

[27]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[28]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[29]  K. Thangavel,et al.  Dimensionality reduction based on rough set theory: A review , 2009, Appl. Soft Comput..

[30]  Salvatore J. Stolfo,et al.  Cost Complexity-Based Pruning of Ensemble Classifiers , 2001, Knowledge and Information Systems.

[31]  Qinghua Hu,et al.  Mixed feature selection based on granulation and approximation , 2008, Knowl. Based Syst..

[32]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[33]  Muhammad Hisyam Lee,et al.  Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification , 2015, Comput. Biol. Medicine.

[34]  Hongmei Chen,et al.  Dynamic maintenance of approximations in set-valued ordered decision systems under the attribute generalization , 2014, Inf. Sci..

[35]  Hervé Glotin,et al.  A matrix modular neural network based on task decomposition with subspace division by adaptive affinity propagation clustering , 2010 .

[36]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[37]  Yiyu Yao,et al.  MGRS: A multi-granulation rough set , 2010, Inf. Sci..

[38]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[39]  Yiyu Yao,et al.  Generalization of Rough Sets using Modal Logics , 1996, Intell. Autom. Soft Comput..

[40]  Konstantinos G. Margaritis,et al.  Confidence ratio affinity propagation in ensemble selection of neural network classifiers for distributed privacy-preserving data mining , 2015, Neurocomputing.

[41]  Bartosz Krawczyk Forming Ensembles of Soft One-Class Classifiers with Weighted Bagging , 2015, New Generation Computing.

[42]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[43]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[44]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[45]  Michele Leone,et al.  Clustering by Soft-constraint Affinity Propagation: Applications to Gene-expression Data , 2022 .

[46]  Lars Kai Hansen,et al.  Neural Network Ensembles , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[47]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[49]  Jun Meng,et al.  Granulation-based symbolic representation of time series and semi-supervised classification , 2011, Comput. Math. Appl..

[50]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[51]  Fabian Model,et al.  Feature selection for DNA methylation based cancer classification , 2001, ISMB.

[52]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[53]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[54]  Ming Zhang,et al.  Neighborhood systems-based rough sets in incomplete information system , 2011, Knowl. Based Syst..

[55]  Yi Pan,et al.  International Journal of Approximate Reasoning a Comparison of Parallel Large-scale Knowledge Acquisition Using Rough Set Theory on Different Mapreduce Runtime Systems , 2022 .

[56]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[57]  Yungho Leu,et al.  A novel hybrid feature selection method for microarray data analysis , 2011, Appl. Soft Comput..

[58]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[59]  Yiyu Yao,et al.  Three-way decisions with probabilistic rough sets , 2010, Inf. Sci..

[60]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[61]  A. Weber,et al.  Plastic and adaptive gene expression patterns associated with temperature stress in Arabidopsis thaliana , 2007, Heredity.

[62]  Nobuyoshi Nakajima,et al.  A method for diagnosis of plant environmental stresses by gene expression profiling using a cDNA macroarray. , 2004, Environmental pollution.

[63]  Adrian E. Raftery,et al.  Weather Forecasting with Ensemble Methods , 2005, Science.

[64]  Huaxiang Zhang,et al.  A spectral clustering based ensemble pruning approach , 2014, Neurocomputing.

[65]  Jie Gui,et al.  Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction , 2010, Comput. Biol. Medicine.