Detecting disease genes based on semi-supervised learning and protein-protein interaction networks

OBJECTIVE Predicting or prioritizing the human genes that cause disease, or "disease genes", is one of the emerging tasks in biomedicine informatics. Research on network-based approach to this problem is carried out upon the key assumption of "the network-neighbour of a disease gene is likely to cause the same or a similar disease", and mostly employs data regarding well-known disease genes, using supervised learning methods. This work aims to find an effective method to exploit the disease gene neighbourhood and the integration of several useful omics data sources, which potentially enhance disease gene predictions. METHODS We have presented a novel method to effectively predict disease genes by exploiting, in the semi-supervised learning (SSL) scheme, data regarding both disease genes and disease gene neighbours via protein-protein interaction network. Multiple proteomic and genomic data were integrated from six biological databases, including Universal Protein Resource, Interologous Interaction Database, Reactome, Gene Ontology, Pfam, and InterDom, and a gene expression dataset. RESULTS By employing a 10 times stratified 10-fold cross validation, the SSL method performs better than the k-nearest neighbour method and the support vector machines method in terms of sensitivity of 85%, specificity of 79%, precision of 81%, accuracy of 82%, and a balanced F-function of 83%. The other comparative experimental evaluations demonstrate advantages of the proposed method given a small amount of labeled data with accuracy of 78%. We have applied the proposed method to detect 572 putative disease genes, which are biologically validated by some indirect ways. CONCLUSION Semi-supervised learning improved ability to study disease genes, especially a specific disease when the known disease genes (as labeled data) are very often limited. In addition to the computational improvement, the analysis of predicted disease proteins indicates that the findings are beneficial in deciphering the pathogenic mechanisms.

[1]  E. Birney,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Research.

[2]  Eric I. Danek,et al.  Phosphorylation of DCC by Fyn mediates Netrin-1 signaling in growth cone guidance , 2004, The Journal of cell biology.

[3]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[4]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[5]  L. Anderson,et al.  The ERBB3 receptor in cancer and cancer gene therapy , 2008, Cancer Gene Therapy.

[6]  Maricel G. Kann,et al.  Protein interactions and disease: computational approaches to uncover the etiology of diseases , 2007, Briefings Bioinform..

[7]  P. Slocombe,et al.  Human p59fyn(T) regulates OKT3-induced calcium influx by a mechanism distinct from PIP2 hydrolysis in Jurkat T cells. , 1995, Journal of immunology.

[8]  Igor Jurisica,et al.  Online Predicted Human Interaction Database , 2005, Bioinform..

[9]  M. Oti,et al.  The modular nature of genetic diseases , 2006, Clinical genetics.

[10]  David J. Porteous,et al.  Speeding disease gene discovery by sequence based candidate prioritization , 2005, BMC Bioinformatics.

[11]  Thanh Phuong Nguyen,et al.  A Semi-supervised Learning Approach to Disease Gene Prediction , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[12]  C. Ouzounis,et al.  Genome-wide identification of genes likely to be involved in human genetic disease. , 2004, Nucleic acids research.

[13]  Hans-Peter Kriegel,et al.  Graph Kernels For Disease Outcome Prediction From Protein-Protein Interaction Networks , 2006, Pacific Symposium on Biocomputing.

[14]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[15]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[16]  Ting Chen,et al.  Further understanding human disease genes by comparing with housekeeping genes and other genes , 2006, BMC Genomics.

[17]  R. Sharan,et al.  Protein networks in disease. , 2008, Genome research.

[18]  T. Gilliam,et al.  Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's disease. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  C. Pecquet,et al.  Mining for JAK-STAT mutations in cancer. , 2008, Trends in biochemical sciences.

[20]  N. Hynes,et al.  The ErbB receptors and their role in cancer progression. , 2003, Experimental cell research.

[21]  Michael Q. Zhang,et al.  Network-based global inference of human disease genes , 2008, Molecular systems biology.

[22]  Roded Sharan,et al.  A Network-Based Method for Predicting Disease-Causing Genes , 2009, J. Comput. Biol..

[23]  See-Kiong Ng,et al.  InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes , 2003, Nucleic Acids Res..

[24]  A. Verma,et al.  Jak family of kinases in cancer , 2003, Cancer and Metastasis Reviews.

[25]  T. Jiang,et al.  Modularity in the genetic disease‐phenotype network , 2008, FEBS letters.

[26]  Xue-wen Chen,et al.  Human Disease-Gene Classification with Integrative Sequence-Based and Topological Features of Protein-Protein Interaction Networks , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[27]  H. Aburatani,et al.  Interpreting expression profiles of cancers by genome-wide survey of breadth of expression in normal tissues. , 2005, Genomics.

[28]  B. Snel,et al.  Predicting disease genes using protein–protein interactions , 2006, Journal of Medical Genetics.

[29]  A. Eyre-Walker,et al.  Human disease genes: patterns and predictions. , 2003, Gene.

[30]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[31]  Frances S. Turner,et al.  POCUS: mining genomic sequence annotation to predict disease genes , 2003, Genome Biology.

[32]  Ian Witten,et al.  Data Mining , 2000 .

[33]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[34]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[35]  A. Bateman,et al.  Protein interactions in human genetic diseases , 2008, Genome Biology.

[36]  Christos A. Ouzounis,et al.  Highly consistent patterns for inherited human diseases at the molecular level , 2006, Bioinform..

[37]  Lin Gao,et al.  International Journal of Biological Sciences , 2011 .

[38]  Tobias Scheffer,et al.  Multi-Relational Learning, Text Mining, and Semi-Supervised Learning for Functional Genomics , 2004, Machine Learning.

[39]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[40]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[41]  Yongjin Li,et al.  Discovering disease-genes by topological features in human protein-protein interaction network , 2006, Bioinform..

[42]  A. Barabasi,et al.  A Protein–Protein Interaction Network for Human Inherited Ataxias and Disorders of Purkinje Cell Degeneration , 2006, Cell.

[43]  Mehmet Koyutürk,et al.  Disease Gene Prioritization Based on Topological Similarity in Protein-Protein Interaction Networks , 2011, RECOMB.

[44]  Mark F. Ciaccio,et al.  Systems-Level Analysis of ErbB4 Signaling in Breast Cancer: A Laboratory to Clinical Perspective , 2008, Molecular Cancer Research.

[45]  Alan R. Powell,et al.  Integration of text- and data-mining using ontologies successfully selects disease gene candidates , 2005, Nucleic acids research.

[46]  E. Dermitzakis From gene expression to disease risk , 2008, Nature Genetics.

[47]  D. Arango,et al.  A gene expression profile that defines colon cell maturation in vitro. , 2002, Cancer research.

[48]  K. D. Sørensen,et al.  Chromosomal deletion, promoter hypermethylation and downregulation of FYN in prostate cancer , 2008, International journal of cancer.

[49]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[50]  P. Radivojac,et al.  An integrated approach to inferring gene–disease associations in humans , 2008, Proteins.

[51]  Francesco Pinciroli,et al.  GFINDer: genetic disease and phenotype location statistical analysis and mining of dynamically annotated gene lists , 2005, Nucleic Acids Res..

[52]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2004, Nucleic Acids Res..

[53]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[54]  Carl Kingsford,et al.  The power of protein interaction networks for associating genes with diseases , 2010, Bioinform..

[55]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .