Simultaneous classification and relevant feature identification in high-dimensional spaces: application to molecular profiling data

Molecular profiling technologies monitor many thousands of transcripts, proteins, metabolites or other species concurrently in a biological sample of interest. Given such high-dimensional data for different types of samples, classification methods aim to assign specimens to known categories. Relevant feature identification methods seek to define a subset of molecules that differentiate the samples. This work describes LIKNON, a specific implementation of a statistical approach for creating a classifier and identifying a small number of relevant features simultaneously. Given two-class data, LIKNON estimates a sparse linear classifier by exploiting the simple and well-known property that minimising an L1 norm (via linear programming) yields a sparse hyperplane. It performs well when used for retrospective analysis of three cancer biology profiling data sets, (i) small, round, blue cell tumour transcript profiles from tumour biopsies and cell lines, (ii) sporadic breast carcinoma transcript profiles from patients with distant metastases < 5 years and those with no distant metastases ≥ 5 years and (iii) serum sample protein profiles from unaffected and ovarian cancer patients. Computationally, LIKNON is less demanding than the prevailing filter-wrapper strategy; this approach generates many feature subsets and equates relevant features with the subset yielding a classifier with the lowest generalisation error. Biologically, the results suggest a role for the cellular microenvironment in influencing disease outcome and its importance in developing clinical decision support systems.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[3]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Bernhard Schölkopf,et al.  Semiparametric Support Vector and Linear Programming Machines , 1998, NIPS.

[5]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[6]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[7]  P S Meltzer,et al.  Gastrointestinal stromal tumors with KIT mutations exhibit a remarkably homogeneous gene expression profile. , 2001, Cancer research.

[8]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[9]  Michael I. Jordan,et al.  Integrated analysis of transcript profiling and protein sequence data , 2003, Mechanisms of Ageing and Development.

[10]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[11]  D. Botstein,et al.  Diversity of gene expression in adenocarcinoma of the lung , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  D. Bertsimas,et al.  Moment Problems and Semidefinite Optimization , 2000 .

[13]  Mina J Bissell,et al.  Isolation, immortalization, and characterization of a human breast epithelial cell line with stem cell properties. , 2002, Genes & development.

[14]  S. Dhanasekaran,et al.  Delineation of prognostic biomarkers in prostate cancer , 2001, Nature.

[15]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[16]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[17]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[18]  R. C. Williamson,et al.  Classification on proximity data with LP-machines , 1999 .

[19]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  J. Kononen,et al.  Tissue microarrays for high-throughput molecular profiling of tumor specimens , 1998, Nature Medicine.

[21]  J W Gray,et al.  Positional cloning of ZNF217 and NABC1: genes amplified at 20q13.2 and overexpressed in breast carcinoma. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[23]  J C Reubi,et al.  Y(1)-mediated effect of neuropeptide Y in cancer: breast carcinomas as targets. , 2001, Cancer research.

[24]  Kristin P. Bennett,et al.  Support vector machines: hype or hallelujah? , 2000, SKDD.

[25]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[26]  Mehmet Fatih Yanik,et al.  Neurosurgery: Functional regeneration after laser axotomy , 2004, Nature.

[27]  Mina J. Bissell,et al.  Putting tumours in context , 2001, Nature Reviews Cancer.

[28]  D. Beier,et al.  Comparative analysis of the mouse and human genes (Matn2 and MATN2) for matrilin-2, a filament-forming protein widely distributed in extracellular matrices. , 2002, Matrix Biology.

[29]  I. Mian,et al.  Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. , 2001, Physiological genomics.

[30]  Xiaoming Huo,et al.  Uncertainty principles and ideal atomic decomposition , 2001, IEEE Trans. Inf. Theory.

[31]  Michael I. Jordan,et al.  Minimax Probability Machine , 2001, NIPS.

[32]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[33]  C. Crane-Robinson,et al.  Detection of micrometastases in lymph nodes from patients with breast cancer , 2002, The British journal of surgery.

[34]  I. Mian,et al.  Integrating naive Bayes models and external knowledge to examine copper and iron homeostasis in S. cerevisiae. , 2000, Physiological genomics.

[35]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[36]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[37]  I. Mian,et al.  Analysis of molecular profile data using generative and discriminative methods. , 2000, Physiological genomics.

[38]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[40]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[41]  E. Petricoin,et al.  Clinical proteomics: personalized molecular medicine. , 2001, JAMA.