A hypergraph-based learning algorithm for classifying gene expression and arrayCGH data with prior knowledge

MOTIVATION Incorporating biological prior knowledge into predictive models is a challenging data integration problem in analyzing high-dimensional genomic data. We introduce a hypergraph-based semi-supervised learning algorithm called HyperPrior to classify gene expression and array-based comparative genomic hybridization (arrayCGH) data using biological knowledge as constraints on graph-based learning. HyperPrior is a robust two-step iterative method that alternatively finds the optimal labeling of the samples and the optimal weighting of the features, guided by constraints encoding prior knowledge. The prior knowledge for analyzing gene expression data is that cancer-related genes tend to interact with each other in a protein-protein interaction network. Similarly, the prior knowledge for analyzing arrayCGH data is that probes that are spatially nearby in their layout along the chromosomes tend to be involved in the same amplification or deletion event. Based on the prior knowledge, HyperPrior imposes a consistent weighting of the correlated genomic features in graph-based learning. RESULTS We applied HyperPrior to test two arrayCGH datasets and two gene expression datasets for both cancer classification and biomarker identification. On all the datasets, HyperPrior achieved competitive classification performance, compared with SVMs and the other baselines utilizing the same prior knowledge. HyperPrior also identified several discriminative regions on chromosomes and discriminative subnetworks in the PPI, both of which contain cancer-related genomic elements. Our results suggest that HyperPrior is promising in utilizing biological prior knowledge to achieve better classification performance and more biologically interpretable findings in gene expression and arrayCGH data. AVAILABILITY http://compbio.cs.umn.edu/HyperPrior CONTACT kuang@cs.umn.edu SUPPLEMENTARY INFORMATION Supplementary data are available at bioinformatics online.

[1]  W. Gerald,et al.  Gene expression profiling predicts clinical outcome of prostate cancer. , 2004, The Journal of clinical investigation.

[2]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[3]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[4]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[5]  D. Hanahan,et al.  The Hallmarks of Cancer , 2000, Cell.

[6]  Serge J. Belongie,et al.  Higher order learning with graphs , 2006, ICML.

[7]  Michalis V. Karamouzis,et al.  Post-translational modifications and regulation of the RAS superfamily of GTPases as anticancer targets , 2007, Nature Reviews Drug Discovery.

[8]  Hsinchun Chen,et al.  A framework of integrating gene relations from heterogeneous data sources: an experiment on Arabidopsis thaliana , 2006, Bioinform..

[9]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[10]  Emmanuel Barillot,et al.  Classification of microarray data using gene networks , 2007, BMC Bioinformatics.

[11]  Emmanuel Barillot,et al.  Classification of arrayCGH data using fused SVM , 2008, ISMB.

[12]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[13]  C Caldas,et al.  High-resolution analysis of chromosome rearrangements on 8p in breast, colon and pancreatic cancer reveals a complex pattern of loss, gain and translocation , 2006, Oncogene.

[14]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[15]  Andrew Zisserman,et al.  Advances in Neural Information Processing Systems (NIPS) , 2007 .

[16]  James C. Bezdek,et al.  Convergence of Alternating Optimization , 2003, Neural Parallel Sci. Comput..

[17]  Justis P. Ehlers,et al.  Functional gene expression analysis uncovers phenotypic switch in aggressive uveal melanomas. , 2006, Cancer research.

[18]  M. DePamphilis,et al.  HUMAN DISEASE , 1957, The Ulster Medical Journal.

[19]  L. Chin,et al.  High-resolution genomic profiles of human lung cancer. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[20]  TaeHyun Hwang,et al.  Learning on Weighted Hypergraphs to Integrate Protein Interactions and Gene Expressions for Cancer Outcome Prediction , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[21]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[22]  Jane Fridlyand,et al.  Bladder Cancer Stage and Outcome by Array-Based Comparative Genomic Hybridization , 2005, Clinical Cancer Research.

[23]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[24]  Jieping Ye,et al.  Identifying biologically relevant genes via multiple heterogeneous data sources , 2008, KDD.

[25]  John Blitzer,et al.  Regularized Learning with Networks of Features , 2008, NIPS.

[26]  Ricardo Saban,et al.  Repeated BCG treatment of mouse bladder selectively stimulates small GTPases and HLA antigens and inhibits single-spanning uroplakins , 2007, BMC Cancer.

[27]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[28]  Dan Theodorescu,et al.  Profiling bladder cancer organ site-specific metastasis identifies LAMC2 as a novel biomarker of hematogenous dissemination. , 2009, The American journal of pathology.

[29]  M. J. van de Vijver,et al.  Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. , 2006, Journal of the National Cancer Institute.

[30]  Bernhard Schölkopf,et al.  Fast protein classification with multiple networks , 2005, ECCB/JBI.

[31]  Baldomero Oliva,et al.  Predicting cancer involvement of genes from heterogeneous data , 2008, BMC Bioinformatics.

[32]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[33]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[34]  C. Sawyers The cancer biomarker problem , 2008, Nature.

[35]  Olivier Chapelle,et al.  A taxonomy of semi-supervised learning algorithms , 2005 .

[36]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .