Smoothing Gene Expression Using Biological Networks

Gene expression (micro array) data have been used widely in bioinformatics. The expression data of a large number of genes from small numbers of subjects are used to identify informative biomarkers that may predict or help in diagnosing some disorders. More recently, increasing amounts of information from underlying relationships of the expressed genes have become available, and workers have started to investigate algorithms which can use such a priori information to improve classification or regression based on gene expression. In this paper, we describe three novel machine learning algorithms for regularizing (smoothing) micro array expression values defined on gene sets with known prior network or metric structures, and which exploit this gene interaction information. These regularized expression values can be used with any machine classifier with the goal of better classification. In this paper, standard smoothing (denoising) techniques previously developed for functions on Euclidean spaces are extended to allow smoothing of micro array expression feature vectors using distance measures defined by biological networks. Such a priori smoothing (denoising) of the feature vectors using metrics on the index space (here the space of genes) yields better signal to noise ratios in the data. When tested on two breast cancer datasets, support vector machine classifiers trained on the smoothed expression values obtain better areas under ROC curves in two cancer datasets.

[1]  Jeffrey T. Chang,et al.  Oncogenic pathway signatures in human cancers as a guide to targeted therapies , 2006, Nature.

[2]  S. Horvath,et al.  A General Framework for Weighted Gene Co-Expression Network Analysis , 2005, Statistical applications in genetics and molecular biology.

[3]  Natalia Shulzhenko,et al.  Microarrays for cancer diagnosis and classification. , 2007, Advances in experimental medicine and biology.

[4]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Trey Ideker,et al.  Testing for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray Data , 2000, J. Comput. Biol..

[6]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[7]  W. Härdle,et al.  Applied Nonparametric Regression , 1991 .

[8]  Shu Yang,et al.  Target Detection Via Network Filtering , 2009, IEEE Transactions on Information Theory.

[9]  Doheon Lee,et al.  Inferring Pathway Activity toward Precise Disease Classification , 2008, PLoS Comput. Biol..

[10]  John D. Lafferty,et al.  Diffusion Kernels on Graphs and Other Discrete Input Spaces , 2002, ICML.

[11]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[12]  Ian M. Donaldson,et al.  iRefIndex: A consolidated protein interaction database with provenance , 2008, BMC Bioinformatics.

[13]  Qing Wang,et al.  Towards precise classification of cancers based on robust gene functional expression profiles , 2005, BMC Bioinformatics.

[14]  David L. Donoho,et al.  De-noising by soft-thresholding , 1995, IEEE Trans. Inf. Theory.

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  S DhillonInderjit,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007 .

[17]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[18]  Mark Kon,et al.  Regularization Techniques for Machine Learning on Graphs and Networks with Biological Applications , 2010 .

[19]  Peter Sarnak,et al.  Linking Numbers of Modular Knots , 2010 .

[20]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[21]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[22]  E. Dougherty,et al.  Accurate and Reliable Cancer Classification Based on Probabilistic Inference of Pathway Activity , 2009, PloS one.

[23]  Lincoln Stein,et al.  Reactome knowledgebase of human biological pathways and processes , 2008, Nucleic Acids Res..

[24]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Emmanuel Barillot,et al.  Classification of microarray data using gene networks , 2007, BMC Bioinformatics.

[26]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[27]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[28]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[29]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  W. Härdle Applied Nonparametric Regression , 1992 .

[31]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[32]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[33]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[34]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[35]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[36]  Tony F. Chan,et al.  Image processing and analysis - variational, PDE, wavelet, and stochastic methods , 2005 .