Improving ELM-based microarray data classification by diversified sequence features selection

Abstract In this paper, we focus on the problem of extreme learning machine (ELM)-based microarray data classification. Different from the traditional classification problem, the goal in this case is not just to predict the class labels for the unseen samples, but to make clear what lead to the results, i.e., the genes involving with a specific disease. This is especially significant for biologists, since they need to decipher the causes of disease. As a black-box method, ELM could not measure up to the task by itself. In this work, we propose a diversified sequence feature selection-based framework to address the problem. In this framework, (1) a sequence model, EWave, is introduced to ensure the structural ordering information among genes exploitable; (2) a concept of irreducible sequence is proposed, where the genes work as an orderly whole to keep high confidence with a specific class and any reduction in the genes decreases the confidence much. An efficient sequence mining algorithm together with some effective pruning rules is developed to mine such sequences; and (3) we study how to extract a set of diversified sequence features as the representative of all mined results. The problem is proved to be NP-hard. A greedy algorithm is presented to approximate the optimal solution. Experimental results show that the proposed approach significantly improves the efficiency and the effectiveness of ELM w.r.t some widely used feature selection techniques.

[1]  David Zuckerman,et al.  On Unapproximable Versions of NP-Complete Problems , 1996, SIAM J. Comput..

[2]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[3]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[4]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[6]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[7]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[8]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[9]  Marina Vannucci,et al.  Gene selection: a Bayesian variable selection approach , 2003, Bioinform..

[10]  Vladimir Pavlovic,et al.  RankGene: identification of diagnostic genes based on expression data , 2003, Bioinform..

[11]  Huan Liu,et al.  Redundancy based feature selection for microarray data , 2004, KDD.

[12]  Guang-Bin Huang,et al.  Extreme learning machine: a new learning scheme of feedforward neural networks , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[13]  Anthony K. H. Tung,et al.  FARMER: finding interesting rule groups in microarray datasets , 2004, SIGMOD '04.

[14]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[15]  Chee Kheong Siew,et al.  Can threshold networks be trained directly? , 2006, IEEE Transactions on Circuits and Systems II: Express Briefs.

[16]  Amit Agarwal,et al.  A new machine learning paradigm for terrain reconstruction , 2006, IEEE Geoscience and Remote Sensing Letters.

[17]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[18]  B. Ponder,et al.  Common germline genetic variation in antioxidant defense genes and survival after diagnosis of breast cancer. , 2007, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[19]  P. Saratchandran,et al.  Multicategory Classification Using An Extreme Learning Machine for Microarray Gene Expression Cancer Diagnosis , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Yi Zhao,et al.  A protein secondary structure prediction framework based on the Extreme Learning Machine , 2008, Neurocomputing.

[21]  Siau-Cheng Khoo,et al.  Mining and Ranking Generators of Sequential Pattern , 2008, SDM 2008.

[22]  Jinyan Li,et al.  Mining and Ranking Generators of Sequential Patterns , 2008, SDM.

[23]  Jianyong Wang,et al.  Efficient mining of frequent sequence generators , 2008, WWW.

[24]  Hongming Zhou,et al.  Optimization method based extreme learning machine for classification , 2010, Neurocomputing.

[25]  Xin Bi,et al.  XML document classification based on ELM , 2011, Neurocomputing.

[26]  Hong Yan,et al.  Fast prediction of protein-protein interaction sites based on Extreme Learning Machines , 2014, Neurocomputing.