CAFÉ-Map: Context Aware Feature Mapping for mining high dimensional biomedical data

Feature selection and ranking is of great importance in the analysis of biomedical data. In addition to reducing the number of features used in classification or other machine learning tasks, it allows us to extract meaningful biological and medical information from a machine learning model. Most existing approaches in this domain do not directly model the fact that the relative importance of features can be different in different regions of the feature space. In this work, we present a context aware feature ranking algorithm called CAFÉ-Map. CAFÉ-Map is a locally linear feature ranking framework that allows recognition of important features in any given region of the feature space or for any individual example. This allows for simultaneous classification and feature ranking in an interpretable manner. We have benchmarked CAFÉ-Map on a number of toy and real world biomedical data sets. Our comparative study with a number of published methods shows that CAFÉ-Map achieves better accuracies on these data sets. The top ranking features obtained through CAFÉ-Map in a gene profiling study correlate very well with the importance of different genes reported in the literature. Furthermore, CAFÉ-Map provides a more in-depth analysis of feature ranking at the level of individual examples. AVAILABILITY CAFÉ-Map Python code is available at: http://faculty.pieas.edu.pk/fayyaz/software.html#cafemap . The CAFÉ-Map package supports parallelization and sparse data and provides example scripts for classification. This code can be used to reconstruct the results given in this paper.

[1]  R. Getzenberg,et al.  Fingerprinting the diseased prostate: Associations between BPH and prostate cancer , 2004, Journal of cellular biochemistry.

[2]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[3]  Holger Sültmann,et al.  The anterior gradient 2 (AGR2) gene is overexpressed in prostate cancer and may be useful as a urine sediment marker for prostate cancer detection , 2011, The Prostate.

[4]  Vojislav Kecman,et al.  Locally linear support vector machines and other local models , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[5]  Michèle Sebag,et al.  Feature Selection as a One-Player Game , 2010, ICML.

[6]  Lluís A. Belanche Muñoz,et al.  Feature selection algorithms: a survey and experimental evaluation , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[7]  E. Gehan,et al.  The properties of high-dimensional data spaces: implications for exploring gene and protein expression data , 2008, Nature Reviews Cancer.

[8]  Z. Mo,et al.  Key pathways involved in prostate cancer based on gene set enrichment analysis and meta analysis. , 2011, Genetics and molecular research : GMR.

[9]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[10]  G. Ball,et al.  RERG (Ras-like, oestrogen-regulated, growth-inhibitor) expression in breast cancer: a marker of ER-positive luminal-like subtype , 2011, Breast Cancer Research and Treatment.

[11]  John L. Semmlow,et al.  Biosignal and Medical Image Processing , 2004 .

[12]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[13]  S. Tavaré,et al.  High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer , 2007, Genome Biology.

[14]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[15]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[16]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[17]  Vijayalakshmi Ananthanarayanan,et al.  Alpha‐methylacyl‐CoA racemase (AMACR) expression in normal prostatic glands and high‐grade prostatic intraepithelial neoplasia (HGPIN): Association with diagnosis of prostate cancer , 2005, The Prostate.

[18]  Cynthia Rudin,et al.  An Interpretable Stroke Prediction Model using Rules and Bayesian Analysis , 2013, AAAI.

[19]  F. Agakov,et al.  Application of high-dimensional feature selection: evaluation for genomic prediction in man , 2015, Scientific Reports.

[20]  E. Diamandis Mass Spectrometry as a Diagnostic and a Cancer Biomarker Discovery Tool , 2004, Molecular & Cellular Proteomics.

[21]  Juan Liu,et al.  Tumor classification based on gene microarray data and hybrid learning method , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[22]  Eytan Ruppin,et al.  Feature Selection via Coalitional Game Theory , 2007, Neural Computation.

[23]  Cor J. Veenman,et al.  Kernel Codebooks for Scene Categorization , 2008, ECCV.

[24]  Yihong Gong,et al.  Nonlinear Learning using Local Coordinate Coding , 2009, NIPS.

[25]  Chih-Jen Lin,et al.  Feature Ranking Using Linear SVM , 2008, WCCI Causation and Prediction Challenge.

[26]  Biaoyang Lin,et al.  The program of androgen-responsive genes in neoplastic prostate epithelium , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[27]  K. Hess,et al.  An Empirical Study of Univariate and Genetic Algorithm-Based Feature Selection in Binary Classification with Microarray Data , 2006, Cancer informatics.

[28]  Hitoshi Iba,et al.  Extraction of informative genes from microarray data , 2005, GECCO '05.

[29]  Janet L Stanford,et al.  Association of hepsin gene variants with prostate cancer risk and prognosis , 2010, The Prostate.

[30]  Valdemar Ortiz,et al.  Lumican expression, localization and antitumor activity in prostate cancer. , 2013, Experimental cell research.

[31]  Antje Baer,et al.  Handbook Of Medical Image Processing And Analysis , 2016 .

[32]  Tieniu Tan,et al.  Feature Coding in Image Classification: A Comprehensive Study , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Celine Vens,et al.  Random Forest Based Feature Induction , 2011, 2011 IEEE 11th International Conference on Data Mining.

[34]  Jonathan M. Garibaldi,et al.  Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data , 2012, PloS one.

[35]  Katerina Oikonomopoulou,et al.  Mass spectrometry: uncovering the cancer proteome for diagnostics. , 2007, Advances in cancer research.

[36]  Verónica Bolón-Canedo,et al.  Feature Selection for High-Dimensional Data , 2015, Artificial Intelligence: Foundations, Theory, and Algorithms.

[37]  Philip H. S. Torr,et al.  Locally Linear Support Vector Machines , 2011, ICML.

[38]  Frank-Michael Schleif,et al.  Learning interpretable kernelized prototype-based models , 2014, Neurocomputing.

[39]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[40]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[41]  Rui Henrique,et al.  MT1G Hypermethylation Is Associated with Higher Tumor Stage in Prostate Cancer , 2005, Cancer Epidemiology Biomarkers & Prevention.

[42]  Robert Koprowski,et al.  Machine learning, medical diagnosis, and biomedical engineering research - commentary , 2014, BioMedical Engineering OnLine.

[43]  K. Rodland Proteomics and cancer diagnosis: the potential of mass spectrometry. , 2004, Clinical biochemistry.

[44]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[45]  Helen Piontkivska,et al.  Analysis of gene expression in prostate cancer epithelial and interstitial stromal cells using laser capture microdissection , 2010, BMC Cancer.

[46]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[47]  Verónica Bolón-Canedo,et al.  A review of feature selection methods on synthetic data , 2013, Knowledge and Information Systems.

[48]  Liang Goh,et al.  A Hybrid Feature Selection Approach for Microarray Gene Expression Data , 2006, International Conference on Computational Science.

[49]  Bin Shen,et al.  Learning dictionary on manifolds for image classification , 2013, Pattern Recognit..

[50]  R. DuBois,et al.  PROSTAGLANDINS AND CANCER , 2005, Gut.

[51]  Stephen L. Lessnick,et al.  Microsatellites with Macro-Influence in Ewing Sarcoma , 2012, Genes.

[52]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[53]  Isabelle Guyon,et al.  Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark , 2007, Pattern Recognit. Lett..

[54]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Natalia Shulzhenko,et al.  Microarrays for cancer diagnosis and classification. , 2007, Advances in experimental medicine and biology.

[56]  C. Lopes,et al.  Aberrant cellular retinol binding protein 1 (CRBP1) gene expression and promoter methylation in prostate cancer , 2004, Journal of Clinical Pathology.

[57]  Liang-Tien Chia,et al.  Local features are not lonely – Laplacian sparse coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[58]  Nikola K. Kasabov,et al.  Ontology-Based Framework for Personalized Diagnosis and Prognosis of Cancer Based on Gene Expression Data , 2007, ICONIP.

[59]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[60]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[61]  Paul Sajda,et al.  Machine learning for detection and diagnosis of disease. , 2006, Annual review of biomedical engineering.

[62]  Pascal Blanchet,et al.  ERG expression in prostate cancer: The prognostic paradox , 2014, The Prostate.

[63]  Li Shen,et al.  Dimension reduction-based penalized logistic regression for cancer classification using microarray data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[64]  Yang Liu,et al.  Locally linear embedding: a survey , 2011, Artificial Intelligence Review.

[65]  Cor J. Veenman,et al.  A protocol for building and evaluating predictors of disease state based on microarray data , 2005, Bioinform..

[66]  I. Ellis,et al.  A gene-expression signature to predict survival in breast cancer across independent data sets , 2007, Oncogene.

[67]  Xiaochun Li,et al.  High-Dimensional Data Analysis in Cancer Research , 2009 .

[68]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[69]  Mahmoud Abbas,et al.  Neurofilament Heavy polypeptide CpG island methylation associates with prognosis of renal cell carcinoma and prediction of antivascular endothelial growth factor therapy response , 2014, Cancer medicine.

[70]  Wei Chu,et al.  Biomarker discovery in microarray gene expression data with Gaussian processes , 2005, Bioinform..

[71]  Matthias Mann,et al.  Bioinformatics analysis of mass spectrometry‐based proteomics data sets , 2009, FEBS letters.

[72]  G. Ball,et al.  The proteins FABP7 and OATP2 are associated with the basal phenotype and patient outcome in human breast cancer , 2010, Breast Cancer Research and Treatment.

[73]  Paulo J. G. Lisboa,et al.  Making machine learning models interpretable , 2012, ESANN.

[74]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[75]  Inge Jonassen,et al.  ERG upregulation and related ETS transcription factors in prostate cancer. , 2007, International journal of oncology.

[76]  Cynthia Rudin,et al.  Algorithms for interpretable machine learning , 2014, KDD.

[77]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[78]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[79]  E. Wagner,et al.  Loss of JUNB/AP-1 promotes invasive prostate cancer , 2014, Cell Death and Differentiation.

[80]  Mei Qi,et al.  Adiponectin as a potential tumor suppressor inhibiting epithelial‐to‐mesenchymal transition but frequently silenced in prostate cancer by promoter methylation , 2015, The Prostate.

[81]  R. Tibshirani,et al.  Gene expression profiling identifies clinically relevant subtypes of prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[82]  Sejong Oh,et al.  CBFS: High Performance Feature Selection Algorithm Based on Feature Clearness , 2012, PloS one.

[83]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[84]  Clemens Otte,et al.  Safe and Interpretable Machine Learning: A Methodological Review , 2013 .

[85]  Barbara Caputo,et al.  Multiclass Latent Locally Linear Support Vector Machines , 2013, ACML.

[86]  Kimberly F. Johnson Methods of Microarray Data Analysis II , 2002, Springer US.

[87]  Bernhard Schölkopf,et al.  Combining a Filter Method with SVMs , 2006, Feature Extraction.

[88]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[89]  M. Salmivirta,et al.  Altered expression of syndecan‐1 in prostate cancer , 2004, APMIS : acta pathologica, microbiologica, et immunologica Scandinavica.