Learning on Weighted Hypergraphs to Integrate Protein Interactions and Gene Expressions for Cancer Outcome Prediction

Building reliable predictive models from multiple complementary genomic data for cancer study is a crucial step towards successful cancer treatment and a full understanding of the underlying biological principles. To tackle this challenging data integration problem, we propose a hypergraph-based learning algorithm called HyperGene to integrate microarray gene expressions and protein-protein interactions for cancer outcome prediction and biomarker identification. HyperGene is a robust two-step iterative method that alternatively finds the optimal outcome prediction and the optimal weighting of the marker genes guided by a protein-protein interaction network. Under the hypothesis that cancer-related genes tend to interact with each other, the HyperGene algorithm uses a protein-protein interaction network as prior knowledge by imposing a consistent weighting of interacting genes. Our experimental results on two large-scale breast cancer gene expression datasets show that HyperGene utilizing a curated protein-protein interaction network achieves significantly improved cancer outcome prediction. Moreover, HyperGene can also retrieve many known cancer genes as highly weighted marker genes.

[1]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[2]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[3]  D. Hanahan,et al.  The Hallmarks of Cancer , 2000, Cell.

[4]  Serge J. Belongie,et al.  Higher order learning with graphs , 2006, ICML.

[5]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[6]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[7]  Pietro Perona,et al.  Beyond pairwise clustering , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  Hui Xiong,et al.  Mining strong affinity association patterns in data sets with skewed support distribution , 2003, Third IEEE International Conference on Data Mining.

[9]  Zhifu Sun,et al.  Gene Expression Profiling on Lung Cancer Outcome Prediction: Present Clinical Value and Future Premise , 2006, Cancer Epidemiology Biomarkers & Prevention.

[10]  M. Oti,et al.  The modular nature of genetic diseases , 2006, Clinical genetics.

[11]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[12]  M. DePamphilis,et al.  HUMAN DISEASE , 1957, The Ulster Medical Journal.

[13]  Hui Xiong,et al.  Identification of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery , 2004, Pacific Symposium on Biocomputing.

[14]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[15]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[16]  M. J. van de Vijver,et al.  Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. , 2006, Journal of the National Cancer Institute.

[17]  Emmanuel Barillot,et al.  Classification of microarray data using gene networks , 2007, BMC Bioinformatics.

[18]  W. Gerald,et al.  Gene expression profiling predicts clinical outcome of prostate cancer. , 2004, The Journal of clinical investigation.

[19]  A. Mendelsohn,et al.  Protein Interaction Methods-Toward an Endgame , 1999, Science.

[20]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[21]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[22]  Quanquan Gu,et al.  Learning the Shared Subspace for Multi-task Clustering and Transductive Transfer Classification , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[23]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[24]  Koji Tsuda,et al.  Propagating distributions on a hypergraph by dual information regularization , 2005, ICML.

[25]  Baldomero Oliva,et al.  Predicting cancer involvement of genes from heterogeneous data , 2008, BMC Bioinformatics.