Prediction of Candidate Primary Immunodeficiency Disease Genes Using a Support Vector Machine Learning Approach

Screening and early identification of primary immunodeficiency disease (PID) genes is a major challenge for physicians. Many resources have catalogued molecular alterations in known PID genes along with their associated clinical and immunological phenotypes. However, these resources do not assist in identifying candidate PID genes. We have recently developed a platform designated Resource of Asian PDIs, which hosts information pertaining to molecular alterations, protein–protein interaction networks, mouse studies and microarray gene expression profiling of all known PID genes. Using this resource as a discovery tool, we describe the development of an algorithm for prediction of candidate PID genes. Using a support vector machine learning approach, we have predicted 1442 candidate PID genes using 69 binary features of 148 known PID genes and 3162 non-PID genes as a training data set. The power of this approach is illustrated by the fact that six of the predicted genes have recently been experimentally confirmed to be PID genes. The remaining genes in this predicted data set represent attractive candidates for testing in patients where the etiology cannot be ascribed to any of the known PID genes.

[1]  R. Panush An autoinflammatory disease with deficiency of the interleukin-1-receptor antagonist , 2011 .

[2]  M. Silverberg,et al.  Gene-centric association mapping of chromosome 3p implicates MST1 in IBD pathogenesis , 2008, Mucosal Immunology.

[3]  Johannes Schuchhardt,et al.  Adaptive encoding neural networks for the recognition of human signal peptide cleavage sites , 2000, Bioinform..

[4]  Sayan Mukherjee,et al.  Molecular classification of multiple tumor types , 2001, ISMB.

[5]  S.-W. Zhang,et al.  Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition , 2007, Amino Acids.

[6]  Andrew D. Johnson,et al.  Bmc Medical Genetics an Open Access Database of Genome-wide Association Results , 2009 .

[7]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[8]  Kumaran Kandasamy,et al.  Human Proteinpedia: a unified discovery resource for proteomics research , 2008, Nucleic Acids Res..

[9]  Paul Horton,et al.  Discrimination of outer membrane proteins using support vector machines , 2005, Bioinform..

[10]  Kazuo Shinozaki,et al.  In silico Analysis of Transcription Factor Repertoire and Prediction of Stress Responsive Transcription Factors in Soybean , 2009, DNA research : an international journal for rapid publication of reports on genes and genomes.

[11]  Munindra Borah,et al.  A Study in Entire Chromosomes of Violations of the Intra-strand Parity of Complementary Nucleotides (Chargaff's Second Parity Rule) , 2009, DNA Research.

[12]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[13]  Jan Freudenberg,et al.  A similarity-based method for genome-wide prediction of disease-relevant human genes , 2002, ECCB.

[14]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[15]  David Altshuler,et al.  Polymorphism at the TNF superfamily gene TNFSF4 confers susceptibility to systemic lupus erythematosus , 2008, Nature Genetics.

[16]  M. Bhasin,et al.  Support Vector Machine-based Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search* , 2005, Journal of Biological Chemistry.

[17]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[18]  Li Wang,et al.  Hybrid huberized support vector machines for microarray classification and gene selection , 2008, Bioinform..

[19]  Judith A. Blake,et al.  The mouse genome database (MGD): new features facilitating a model system , 2006, Nucleic Acids Res..

[20]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[21]  Jing Chen,et al.  Disease candidate gene identification and prioritization using protein interaction networks , 2009, BMC Bioinformatics.

[22]  Luigi D. Notarangelo,et al.  Immunological and genetic bases of new primary immunodeficiencies , 2007, Nature Reviews Immunology.

[23]  Tommy W. S. Chow,et al.  Identifying the biologically relevant gene categories based on gene expression and biological data: an example on prostate cancer , 2007, Bioinform..

[24]  Bodo Grimbacher,et al.  A syndrome with congenital neutropenia and mutations in G6PC3. , 2009, The New England journal of medicine.

[25]  J. Nadeau,et al.  Finding Genes That Underlie Complex Traits , 2002, Science.

[26]  J. Banchereau,et al.  Pyogenic Bacterial Infections in Humans with MyD88 Deficiency , 2003, Science.

[27]  Mauno Vihinen,et al.  Identification of candidate disease genes by integrating Gene Ontologies and protein-interaction networks: case study of primary immunodeficiencies , 2008, Nucleic acids research.

[28]  Haidong Wang,et al.  Discovering molecular pathways from protein interaction and gene expression data , 2003, ISMB.

[29]  S. Gabriel,et al.  Two independent alleles at 6q23 associated with risk of rheumatoid arthritis , 2007, Nature Genetics.

[30]  Judith A. Blake,et al.  The Mouse Genome Database genotypes::phenotypes , 2008, Nucleic Acids Res..

[31]  William Stafford Noble,et al.  Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure , 2006, Bioinform..

[32]  T. Ogihara,et al.  The association of CTLA4 polymorphism with type 1 diabetes is concentrated in patients complicated with autoimmune thyroid disease: a multicenter collaborative study in Japan. , 2006, The Journal of clinical endocrinology and metabolism.

[33]  William Stafford Noble,et al.  A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. , 2003, Journal of proteome research.

[34]  Yu-Dong Cai,et al.  Support Vector Machines for predicting protein structural class , 2001, BMC Bioinformatics.

[35]  Alan F. Scott,et al.  McKusick's Online Mendelian Inheritance in Man (OMIM®) , 2008, Nucleic Acids Res..

[36]  Wentian Li,et al.  STAT4 and the risk of rheumatoid arthritis and systemic lupus erythematosus. , 2007, The New England journal of medicine.

[37]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[38]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): mouse biology and model systems , 2007, Nucleic Acids Res..

[39]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[40]  Masato Tanaka,et al.  Construction of an open-access database that integrates cross-reference information from the transcriptome and proteome of immune cells , 2007, Bioinform..

[41]  Luc J. Smink,et al.  Association of the T-cell regulatory gene CTLA4 with susceptibility to autoimmune disease , 2003, Nature.

[42]  T. Nishio,et al.  A Brassica rapa Linkage Map of EST-based SNP Markers for Identification of Candidate Genes Controlling Flowering Time and Leaf Morphological Traits , 2009, DNA Research.

[43]  Ian M. Donaldson,et al.  BIND: the Biomolecular Interaction Network Database , 2001, Nucleic Acids Res..

[44]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[45]  P. Radivojac,et al.  An integrated approach to inferring gene–disease associations in humans , 2008, Proteins.

[46]  Marta E Alarcón-Riquelme,et al.  Genome-wide association scan in women with systemic lupus erythematosus identifies susceptibility variants in ITGAM, PXK, KIAA1542 and other loci , 2008, Nature Genetics.

[47]  R. A. Bailey,et al.  Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes , 2007, Nature Genetics.

[48]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Sandhya Rani,et al.  RAPID: Resource of Asian Primary Immunodeficiency Diseases , 2008, Nucleic Acids Res..

[50]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[51]  D. Botstein,et al.  Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease , 2003, Nature Genetics.

[52]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[53]  Sandra D'Alfonso,et al.  Functional variants in the B-cell gene BANK1 are associated with systemic lupus erythematosus , 2008, Nature Genetics.

[54]  Wei Chen,et al.  A nonsynonymous functional variant in integrin-αM (encoded by ITGAM) is associated with systemic lupus erythematosus , 2008, Nature Genetics.

[55]  U. Broeckel,et al.  An autoinflammatory disease due to homozygous deletion of the IL1RN locus. , 2009, The New England journal of medicine.

[56]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[57]  Kai Wang,et al.  Pathway-based approaches for analysis of genomewide association studies. , 2007, American journal of human genetics.

[58]  Shigeaki Nonoyama,et al.  Primary immunodeficiency diseases: an update from the International Union of Immunological Societies Primary Immunodeficiency Diseases Classification Committee. , 2007, The Journal of allergy and clinical immunology.

[59]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[60]  L. O’Neill,et al.  Signalling adaptors used by Toll-like receptors: an update. , 2008, Cytokine.

[61]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[62]  David J. Chen,et al.  A DNA-PKcs mutation in a radiosensitive T-B- SCID patient inhibits Artemis activation and nonhomologous end-joining. , 2008, The Journal of clinical investigation.

[63]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[64]  Xiaodong Lin,et al.  Gene expression Gene selection using support vector machines with non-convex penalty , 2005 .

[65]  Jack Y. Yang,et al.  A comparative study of different machine learning methods on microarray gene expression data , 2008, BMC Genomics.

[66]  S. Knapp,et al.  Girls homozygous for an IL-2-inducible T cell kinase mutation that leads to protein deficiency develop fatal EBV-associated lymphoproliferation. , 2009, The Journal of clinical investigation.

[67]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..