Gene selection using rough set based on neighborhood for the analysis of plant stress response

A novel neighborhood is defined to deal with numerical data more flexibly.We compare the performances of two kinds of approximation operators on expression data.We propose a significant gene selection algorithm based on positive region and gene ranking (SGS_PRGR).We extend SGS_PRGR to SGS_PRGR_TO algorithm by introduced NSGAII for thresholds optimization.The proposed algorithms are applied to analyze plant stress. Gene selection and sample classification based on gene expression data are important research trends in bioinformatics. It is very difficult to select significant genes closely related to classification because of the high dimension and small sample size of gene expression data. Rough set based on neighborhood has been successfully applied to gene selection, as it selects attributes without redundancy and deals with numerical attributes directly. Construction of neighborhoods, approximation operators and attribute reduction algorithm are three key components in this gene selection approach. In this study, a novel neighborhood named intersection neighborhood for numerical data was defined. The performances of two kinds of approximation operators were compared on gene expression data. A significant gene selection algorithm, which was applied to the analysis of plant stress response, was proposed by using positive region and gene ranking, and then this algorithm with thresholds optimization for intersection neighborhood was extended. The performance of the proposed algorithm, along with a comparison with other related methods, classical algorithms and rough set methods, was analyzed. The results of experiments on four data sets showed that intersection neighborhood was more flexible to adapt to the data with various structure, and approximation operator based on elementary set was more suitable for this application than that based on element. That was to say that the proposed algorithms were effective, as they could select significant gene subsets without redundancy and achieve high classification accuracy.

[1]  Jianhua Dai,et al.  Attribute selection based on information gain ratio in fuzzy rough set theory with application to tumor classification , 2013, Appl. Soft Comput..

[2]  Yanchun Liang,et al.  Prediction of Drought-Resistant Genes in Arabidopsis thaliana Using SVM-RFE , 2011, PloS one.

[3]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[4]  Alberto M Marchevsky,et al.  Classification of individual lung cancer cell lines based on DNA methylation markers: use of linear discriminant analysis and artificial neural networks. , 2004, The Journal of molecular diagnostics : JMD.

[5]  César Hervás-Martínez,et al.  Evolutionary Generalized Radial Basis Function neural networks for improving prediction accuracy in gene classification using feature selection , 2012, Appl. Soft Comput..

[6]  Jianhua Dai,et al.  Conditional entropy for incomplete decision systems and its application in data mining , 2012, Int. J. Gen. Syst..

[7]  Pradipta Maji,et al.  Rough set based maximum relevance-maximum significance criterion and Gene selection from microarray data , 2011, Int. J. Approx. Reason..

[8]  Jiye Liang,et al.  Attribute reduction for dynamic data sets , 2013, Appl. Soft Comput..

[9]  Florentino Fernández Riverola,et al.  Using Variable Precision Rough Set for Selection and Classification of Biological Knowledge Integrated in DNA Gene Expression , 2012, Journal of integrative bioinformatics.

[10]  Witold Pedrycz,et al.  Selecting Discrete and Continuous Features Based on Neighborhood Decision Error Minimization , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[11]  Jiyuan An,et al.  Finding Rule Groups to Classify High Dimensional Gene Expression Datasets , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[12]  Jan Komorowski,et al.  Learning Rough Set Classifiers from Gene Expressions and Clinical Data , 2002, Fundam. Informaticae.

[13]  Daiqing Huang,et al.  Journal of Experimental Botany, Page 1 of 17 , 2007 .

[14]  Shu-Lin Wang,et al.  Neighborhood Rough Set Reduction-Based Gene Selection and Prioritization for Gene Expression Profile Analysis and Molecular Cancer Classification , 2010, Journal of biomedicine & biotechnology.

[15]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[16]  Shulin Wang,et al.  Neighborhood Rough Set Model Based Gene Selection for Multi-subtype Tumor Classification , 2008, ICIC.

[17]  Qinghua Hu,et al.  Mixed feature selection based on granulation and approximation , 2008, Knowl. Based Syst..

[18]  Pradipta Maji,et al.  Rough set based gene selection algorithm for microarray sample classification , 2010, 2010 International Conference on Methods and Models in Computer Science (ICM2CS-2010).

[19]  Dingfang Li,et al.  Gene Selection Using Rough Set Theory , 2006, RSKT.

[20]  Fabian Model,et al.  Feature selection for DNA methylation based cancer classification , 2001, ISMB.

[21]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[22]  Ming Zhang,et al.  Neighborhood systems-based rough sets in incomplete information system , 2011, Knowl. Based Syst..

[23]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[24]  Jun Meng,et al.  Granulation-based symbolic representation of time series and semi-supervised classification , 2011, Comput. Math. Appl..

[25]  Qinghua Hu,et al.  Neighborhood rough set based heterogeneous feature subset selection , 2008, Inf. Sci..

[26]  Shutao Li,et al.  Gene Selection Using Neighborhood Rough Set from Gene Expression Profiles , 2007 .

[27]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[28]  A. Weber,et al.  Plastic and adaptive gene expression patterns associated with temperature stress in Arabidopsis thaliana , 2007, Heredity.

[29]  Nobuyoshi Nakajima,et al.  A method for diagnosis of plant environmental stresses by gene expression profiling using a cDNA macroarray. , 2004, Environmental pollution.

[30]  Jie Gui,et al.  Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction , 2010, Comput. Biol. Medicine.

[31]  Yiyu Yao,et al.  MGRS: A multi-granulation rough set , 2010, Inf. Sci..

[32]  Yiyu Yao,et al.  Generalization of Rough Sets using Modal Logics , 1996, Intell. Autom. Soft Comput..

[33]  Jiyuan An,et al.  Finding Rule Groups to Classify High Dimensional Gene Expression Datasets , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[34]  Sushmita Mitra,et al.  Evolutionary Rough Feature Selection in Gene Expression Data , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[35]  Yiyu Yao,et al.  Relational Interpretations of Neigborhood Operators and Rough Set Approximation Operators , 1998, Inf. Sci..

[36]  Mehdi Khashei,et al.  A fuzzy intelligent approach to the classification problem in gene expression data analysis , 2012, Knowl. Based Syst..

[37]  Peng Wang,et al.  Knowledge Dependency and Rule Induction on Tolerance Rough Sets , 2013, J. Multiple Valued Log. Soft Comput..

[38]  Jishuang Chen,et al.  Microarray analysis of gene expression profile induced by the biocontrol yeast Cryptococcus laurentii in cherry tomato fruit. , 2009, Gene.

[39]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[40]  Elizabeth Tapia,et al.  Sparse and stable gene selection with consensus SVM-RFE , 2012, Pattern Recognit. Lett..

[41]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[42]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .