A novel PSO-based graph-theoretic approach for identifying most relevant and non-redundant gene markers from gene expression data

Cancer is an extremely complex, heterogeneous and mutated genetic disease. Many researchers in molecular genetics have predicted a number of key genes which probably contribute to oncogenesis and potential drug targets for different types of cancer. However, this is still an ongoing process. In this article, not only the gene relevance is considered, but also the redundancy among genes is taken care of. For identifying the non-redundant gene markers from microarray gene expression data, a graph-theoretic approach has been presented. The sample versus gene data presented by microarray data are first converted into a weighted undirected complete feature-graph where the nodes represent the genes having gene's relevance as node weights and the edges are weighted according to the similarity value (correlation) among the genes. Then, the densest subgraph having minimum average edge weight (similarity) and maximum average node weight (relevance) is identified from the original feature-graph. To find the densest subgraph, binary particle swarm optimisation has been applied for minimising the average edge weight and maximising the average node weight through a single objective function. Thus, an optimised reduced subgraph is found which contains a set of selected genes for which average correlation is very less and average gene relevance is very high. The proposed method is compared with sequential forward search, T-test, Rank-sum test, minimum redundancy maximum relevance scheme, correlation-based feature selection, sequential backward elimination and fast correlation-based filter solutions in terms of sensitivity, specificity, accuracy, F-score, area under the receiver operating characteristic curve, average correlation and stability on several real-life data-sets.

[1]  K.Z. Mao,et al.  Orthogonal forward selection and backward elimination algorithms for feature subset selection , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[2]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Moritz Köhler Improving Docetaxel Breast Cancer Treatment through Gen Expression Data , 2004 .

[4]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[5]  Li-Yeh Chuang,et al.  An Improved Binary Particle Swarm Optimization with Complementary Distribution Strategy for Feature Selection , 2011 .

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  Hitoshi Iba,et al.  Extraction of informative genes from microarray data , 2005, GECCO '05.

[8]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[9]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Xiaohui Cui,et al.  Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm , 2005 .

[11]  Mohd Saberi Mohamad,et al.  An improved binary particle swarm optimization algorithm for genes selection and classification of colon cancer data , 2008 .

[12]  M. Rao,et al.  On the performance of the particle swarm optimization algorithm with various inertia weight variants for computing optimal control of a class of hybrid systems , 2006 .

[13]  Sanghamitra Bandyopadhyay,et al.  Analysis of Biological Data: A Soft Computing Approach , 2007, Science, Engineering, and Biology Informatics.

[14]  Richard Mankiewicz The Story of Mathematics , 2001 .

[15]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[16]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[17]  Thomas E. Potok,et al.  Document clustering using particle swarm optimization , 2005, Proceedings 2005 IEEE Swarm Intelligence Symposium, 2005. SIS 2005..

[18]  Edwin R. Hancock,et al.  A Graph-Based Approach to Feature Selection , 2011, GbRPR.

[19]  Hassan Ghassemian,et al.  Maximum relevance, minimum redundancy band selection for hyperspectral images , 2011, 2011 19th Iranian Conference on Electrical Engineering.

[20]  M. A. Khanesar,et al.  A novel binary particle swarm optimization , 2007, 2007 Mediterranean Conference on Control & Automation.

[21]  Yun Li,et al.  A Hybrid Method of Unsupervised Feature Selection Based on Ranking , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[22]  S. Chatterjee,et al.  Regression Analysis by Example , 1979 .

[23]  Russ B. Altman,et al.  Nonparametric methods for identifying differentially expressed genes in microarray data , 2002, Bioinform..

[24]  Michael N. Vrahatis,et al.  Particle Swarm Optimization and Intelligence: Advances and Applications , 2010 .

[25]  Li-Yeh Chuang,et al.  Improved binary PSO for feature selection using gene expression data , 2008, Comput. Biol. Chem..

[26]  D. Andina,et al.  Feature selection using Sequential Forward Selection and classification applying Artificial Metaplasticity Neural Network , 2010, IECON 2010 - 36th Annual Conference on IEEE Industrial Electronics Society.

[27]  Ujjwal Maulik,et al.  Computational Intelligence and Pattern Analysis in Biological Informatics: Maulik/Computational Intelligence , 2010 .

[28]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[29]  Ujjwal Maulik,et al.  Multi-Class Clustering of Cancer Subtypes through SVM Based Ensemble of Pareto-Optimal Solutions for Gene Marker Identification , 2010, PloS one.

[30]  Robert Gentleman,et al.  Differential expression with the Bioconductor Project , 2005 .

[31]  Ujjwal Maulik,et al.  Multiobjective Genetic Algorithms for Clustering - Applications in Data Mining and Bioinformatics , 2011 .

[32]  Fillia Makedon,et al.  HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data , 2005, Bioinform..

[33]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[34]  Ali S. Hadi,et al.  Regression Analysis by Example: Chatterjee/Regression , 2006 .

[35]  Franco Locatelli,et al.  Gene expression-based classification as an independent predictor of clinical outcome in juvenile myelomonocytic leukemia. , 2010, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[36]  Syed Mohsin,et al.  Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer , 2003, The Lancet.

[37]  J. Downing,et al.  Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells , 2003, Nature Genetics.

[38]  R Kahavi,et al.  Wrapper for feature subset selection , 1997 .

[39]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Shyam Visweswaran,et al.  Measuring Stability of Feature Selection in Biomedical Datasets , 2009, AMIA.

[41]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[42]  Hau-San Wong,et al.  Extracting gene regulation information for cancer classification , 2007, Pattern Recognit..

[43]  M Reyes Sierra,et al.  Multi-Objective Particle Swarm Optimizers: A Survey of the State-of-the-Art , 2006 .