A genetic algorithm for optimizing subnetwork markers for the study of breast cancer metastasis

The combined use of gene expression profiles and protein-protein interaction networks has shown remarkable successes in the prediction of breast cancer metastases. Nevertheless, as a primary step of network-based methods, the problem of effectively identifying predictive subnetwork markers remains a great challenge. Typically, existing methods use greedy search algorithms to search for subnetworks. This strategy, though efficient in time complexity, may fail in finding the optimal subnetwork markers and accordingly impair the performance of the successive learning machines. In this paper, we propose a genetic algorithm to improve the subnetwork markers that have been identified by an existing greedy search method. We demonstrate that the discriminative power of the optimized subnetwork markers are significantly higher than the original subnetwork markers, and we show that higher classification performance can be achieved when using the optimized subnetworks as predictive features via six popular machine learning approaches (logistic regression, support vector machine, decision tree, Adaboost, random forest and Logitboost). According to the comparison between different classification approaches, Logitboost with the optimized subnetwork markers shows the highest classification performance and optimal reproducibility for identifying breast cancer metastases.

[1]  Charles Wiggins,et al.  Annual report to the nation on the status of cancer, 1975-2004, featuring cancer in American Indians and Alaska Natives. , 2007, Cancer.

[2]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[3]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[4]  Razvan C. Bunescu,et al.  Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome , 2005, Genome Biology.

[5]  Hongzhe Li,et al.  A Markov random field model for network-based analysis of genomic data , 2007, Bioinform..

[6]  J. Peterse,et al.  Breast cancer metastasis: markers and models , 2005, Nature Reviews Cancer.

[7]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Li Xiao The Approach for Classification of Tumor Gene Expression Data Based on SAM and GA/SVM , 2008 .

[12]  M. Gerstein,et al.  Sequence variation in G-protein-coupled receptors: analysis of single nucleotide polymorphisms , 2005, Nucleic acids research.

[13]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[14]  Hua Yang,et al.  Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy , 2006, BMC Bioinformatics.

[15]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[16]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[17]  S. L. Wong,et al.  Towards a proteome-scale map of the human protein–protein interaction network , 2005, Nature.