A computational method for identification of disease-associated non-coding SNPs in human genome

Accurate identification of functionally relevant variants against the ubiquitous background genetic variations is a significant challenge facing bioinformatics researchers and the challenge becomes more severe for non-coding variants. In this study, a novel computational method to identify candidate disease-associated non-coding single nucleotide polymorphisms (SNPs) of human genome is presented. To characterize SNPs, an extensive range of features, such as sequence context, DNA structure, evolutionary conservation and histone modification signals etc. are extracted. Then random forest is adopted to build the classifier model together with an ensemble method to deal with unbalanced data. 10-fold cross-validation result shows that the proposed method can achieve accuracy with the area under ROC curve (AUC) of 0.74. All the original data and the source matlab codes involved are available at https://sourceforge.net/projects/dissnp-predict/.

[1]  Serafim Batzoglou,et al.  Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++ , 2010, PLoS Comput. Biol..

[2]  A. Riva Large-scale computational identification of regulatory SNPs with rSNP-MAPPER , 2012, BMC Genomics.

[3]  L. Bryzgalov,et al.  Detection of Regulatory SNPs in Human Genome Using ChIP-seq ENCODE Data , 2013, PloS one.

[4]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[5]  E. Zeggini,et al.  Functional annotation of non-coding sequence variants , 2014, Nature Methods.

[6]  Vladimir B. Bajic,et al.  Bioinformatics Applications Note Sequence Analysis Dragon Polya Spotter: Predictor of Poly(a) Motifs within Human Genomic Dna Sequences , 2022 .

[7]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[8]  Swetlana Nikolajewa,et al.  DiProDB: a database for dinucleotide properties , 2008, Nucleic Acids Res..

[9]  A. Brookes,et al.  GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies , 2013, European Journal of Human Genetics.

[10]  J. Shendure,et al.  Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data , 2011, Nature Reviews Genetics.

[11]  Daniel R. Zerbino,et al.  Ensembl 2016 , 2015, Nucleic Acids Res..

[12]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Stephen C. J. Parker,et al.  A map of minor groove shape and electrostatic potential from hydroxyl radical cleavage patterns of DNA. , 2011, ACS chemical biology.

[14]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[15]  Eurie L. Hong,et al.  Annotation of functional variation in personal genomes using RegulomeDB , 2012, Genome research.

[16]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[19]  Michael A. Beer,et al.  Robust k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}-mer frequency estimation using gapped k\docu , 2013, Journal of Mathematical Biology.

[20]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[21]  James Bailey,et al.  is-rSNP: a novel technique for in silico regulatory SNP detection , 2010, Bioinform..

[22]  Vladimir B. Bajic,et al.  Dragon PolyA Spotter: predictor of poly(A) motifs within human genomic DNA sequences , 2011, Bioinform..

[23]  K. Pollard,et al.  Detection of nonneutral substitution rates on mammalian phylogenies. , 2010, Genome research.

[24]  Rong Li,et al.  A computational method for prediction of rSNPs in human genome , 2016, Comput. Biol. Chem..

[25]  R. Laxton The measure of diversity. , 1978, Journal of theoretical biology.

[26]  Martin Renqiang Min,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[27]  †The International HapMap Consortium The International HapMap Project , 2003, Nature.

[28]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[29]  Jing Hu,et al.  SIFT web server: predicting effects of amino acid substitutions on proteins , 2012, Nucleic Acids Res..