Prioritisation of candidate Single Amino Acid Polymorphisms using one-class learning machines

Recent advancements of the next-generation sequencing technology have enabled the direct sequencing of rare genetic variants in both case and control individuals. Although there have been a few statistical methods for uncovering potential associations between multiple rare variants and human inherited diseases, most of these methods require computational approaches to filter out non-functional variants for the purpose of maximising the statistical power. To tackle this problem, we formulate the detection of genetic variants that are associated with a specific type of disease from the perspective of one-class novelty learning. We focus on a typical type of genetic variants called Single Amino Acid Polymorphisms (SAAPs), and we take advantages of a feature selection mechanism and two one-class learning methods to prioritise candidate SAAPs. Systematic validation demonstrates that the proposed model is effective in recovering disease-associated SAAPs.

[1]  Kai Wang,et al.  Pathway-based approaches for analysis of genomewide association studies. , 2007, American journal of human genetics.

[2]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[3]  J. Houwing-Duistermaat,et al.  Genome-wide association study (GWAS)-identified disease risk alleles do not compromise human longevity , 2010, Proceedings of the National Academy of Sciences.

[4]  W. Bodmer,et al.  Common and rare variants in multifactorial susceptibility to common diseases , 2008, Nature Genetics.

[5]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[6]  Jianmin Jiang,et al.  Network Anomaly Detection Using One Class Support Vector Machine , 2008 .

[7]  David Zhang,et al.  Two-stage image denoising by principal component analysis with local pixel grouping , 2010, Pattern Recognit..

[8]  Don R. Hush,et al.  Network constraints and multi-objective optimization for one-class classification , 1996, Neural Networks.

[9]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[10]  Joshua M. Stuart,et al.  MICROARRAY EXPERIMENTS : APPLICATION TO SPORULATION TIME SERIES , 1999 .

[11]  Richa Agarwala,et al.  COBALT: constraint-based alignment tool for multiple protein sequences , 2007, Bioinform..

[12]  Charles Rotimi,et al.  A Genome-Wide Association Study of Hypertension and Blood Pressure in African Americans , 2009, PLoS genetics.

[13]  Richard Robinson,et al.  Common Disease, Multiple Rare (and Distant) Variants , 2010, PLoS biology.

[14]  María Martín,et al.  The Universal Protein Resource (UniProt) in 2010 , 2010 .

[15]  Hua Yang,et al.  Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy , 2006, BMC Bioinformatics.

[16]  Rui Jiang,et al.  Comparative study of ensemble learning approaches in the identification of disease mutations , 2010, 2010 3rd International Conference on Biomedical Engineering and Informatics.

[17]  S. Browning,et al.  A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic , 2009, PLoS genetics.

[18]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[19]  Elizabeth M. Smigielski,et al.  dbSNP: a database of single nucleotide polymorphisms , 2000, Nucleic Acids Res..

[20]  Suzanne M. Leal,et al.  A Novel Adaptive Method for the Analysis of Next-Generation Sequencing Data to Detect Complex Trait Associations with Rare Variants Due to Gene Main Effects and Interactions , 2010, PLoS genetics.

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[23]  P. Bork,et al.  Human non-synonymous SNPs: server and survey. , 2002, Nucleic acids research.

[24]  Gaurav Bhatia,et al.  A Covering Method for Detecting Genetic Associations between Rare Variants and Common Phenotypes , 2010, PLoS Comput. Biol..

[25]  Donald F. Specht,et al.  Probabilistic neural networks , 1990, Neural Networks.

[26]  C.-C. Jay Kuo,et al.  Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations. , 2007, American journal of human genetics.

[27]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[28]  P. Stenson,et al.  Human Gene Mutation Database (HGMD , 2003 .

[29]  Philip D. Wasserman,et al.  Advanced methods in neural computing , 1993, VNR computer library.

[30]  Mi Zhou,et al.  nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms , 2005, Nucleic Acids Res..

[31]  Yan P. Yuan,et al.  HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources , 2002, Nucleic Acids Res..

[32]  S. Leal,et al.  Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. , 2008, American journal of human genetics.

[33]  J. Florez,et al.  The genetics of type 2 diabetes: what have we learned from GWAS? , 2010, Annals of the New York Academy of Sciences.

[34]  Alastair Forbes,et al.  Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility , 2007, Nature Genetics.

[35]  D. Cooper,et al.  Human Gene Mutation Database , 1996, Human Genetics.

[36]  Mark M Iles,et al.  What Can Genome-Wide Association Studies Tell Us about the Genetics of Common Disease , 2008, PLoS genetics.

[37]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[38]  A. Negi,et al.  Positive association of common variants in CD36 with neovascular age-related macular degeneration , 2009, Aging.

[39]  C. Hoggart,et al.  Pathway Analysis of GWAS Provides New Insights into Genetic Susceptibility to 3 Inflammatory Diseases , 2009, PloS one.

[40]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[41]  Igor I Baskin,et al.  The One‐Class Classification Approach to Data Description and to Models Applicability Domain , 2010, Molecular informatics.

[42]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[43]  M. Spitz,et al.  Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. , 2008, American journal of human genetics.

[44]  Donald F. Specht,et al.  Probabilistic neural networks and the polynomial Adaline as complementary techniques for classification , 1990, IEEE Trans. Neural Networks.