Local-Nearest-Neighbors-Based Feature Weighting for Gene Selection

Selecting functional genes is essential for analyzing microarray data. Among many available feature (gene) selection approaches, the ones on the basis of the large margin nearest neighbor receive more attention due to their low computational costs and high accuracies in analyzing the high-dimensional data. Yet, there still exist some problems that hamper the existing approaches in sifting real target genes, including selecting erroneous nearest neighbors, high sensitivity to irrelevant genes, and inappropriate evaluation criteria. Previous pioneer works have partly addressed some of the problems, but none of them are capable of solving these problems simultaneously. In this paper, we propose a new local-nearest-neighbors-based feature weighting approach to alleviate the above problems. The proposed approach is based on the trick of locally minimizing the within-class distances and maximizing the between-class distances with the <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="an-ieq1-2712775.gif"/></alternatives></inline-formula> nearest neighbors rule. We further define a feature weight vector, and construct it by minimizing the cost function with a regularization term. The proposed approach can be applied naturally to the multi-class problems and does not require extra modification. Experimental results on the UCI and the open microarray data sets validate the effectiveness and efficiency of the new approach.

[1]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[2]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Michael K. Ng,et al.  Feature weight estimation for gene selection: a local hyperlinear learning approach , 2014, BMC Bioinformatics.

[6]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[7]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[8]  Naftali Tishby,et al.  Margin based feature selection - theory and algorithms , 2004, ICML.

[9]  G. Bontempi,et al.  A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Jin-Mao Wei,et al.  Ensemble Rough Hypercuboid Approach for Classifying Cancers , 2010, IEEE Transactions on Knowledge and Data Engineering.

[11]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[12]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[13]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[14]  Yijun Sun,et al.  Iterative RELIEF for Feature Weighting: Algorithms, Theories, and Applications , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Jing Zhang,et al.  Gene Selection Integrated with Biological Knowledge for Plant Stress Response Using Neighborhood System and Rough Set Theory , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Haiyan Wang,et al.  Improving accuracy for cancer classification with a new algorithm for genes selection , 2012, BMC Bioinformatics.

[17]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[18]  Yun Li,et al.  Feature selection based on loss-margin of nearest neighbor classification , 2009, Pattern Recognit..

[19]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[20]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[21]  Masashi Sugiyama,et al.  Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification , 2011, NIPS.

[22]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Koby Crammer,et al.  Margin Analysis of the LVQ Algorithm , 2002, NIPS.

[24]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[25]  Sinisa Todorovic,et al.  Local-Learning-Based Feature Selection for High-Dimensional Data Analysis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[27]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[28]  Majid Komeili,et al.  Local Feature Selection for Data Classification , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[30]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[31]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[32]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[33]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[35]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[36]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[37]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[38]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.