Data mining of public SNP databases for the selection of intragenic SNPs

Different strategies to search public single nucleotide polymorphism (SNP) databases for intragenic SNPs were evaluated. First, we assembled a strategy to annotate SNPs onto candidate genes based on a BLAST search of public SNP databases (Intragenic SNP Annotation by BLAST, ISAB). Only BLAST hits that complied with stringent criteria according to 1) percentage identity (minimum 98%), 2) BLAST hit length (the hit covers at least 98% of the length of the SNP entry in the database, or the hit is longer than 250 base pairs), and 3) location in non‐repetitive DNA, were considered as valid SNPs. We assessed the intragenic context and redundancy of these SNPs, and demonstrated that the SNP content of the dbSNP and HGBASE/HGVbase databases are highly complementary but also overlap significantly. Second, we assessed the validity of intragenic SNP annotation available on the dbSNP and HGVbase websites by comparison with the results of the ISAB strategy. Only a minority of all annotated SNPs was found in common between the respective public SNP database websites and the ISAB annotation strategy. A detailed analysis was performed aiming to explain this discrepancy. As a conclusion, we recommend the application of an independent strategy (such as ISAB) to annotate intragenic SNPs, complementary to the annotation provided at the dbSNP and HGVbase websites. Such an approach might be useful in the selection process of intragenic SNPs for genotyping in genetic studies. Hum Mutat 20:162–173, 2002. © 2002 Wiley‐Liss, Inc.

[1]  J. Jurka Repbase update: a database and an electronic journal of repetitive elements. , 2000, Trends in genetics : TIG.

[2]  Pui-Yan Kwok,et al.  Single-nucleotide polymorphisms in the public domain: how useful are they? , 2001, Nature Genetics.

[3]  Frank Dudbridge,et al.  Haplotype tagging for the identification of common disease genes , 2001, Nature Genetics.

[4]  O Pelkonen,et al.  Polymorphisms of CYP2A6 and its practical consequences. , 2001, British journal of clinical pharmacology.

[5]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[6]  S. Chong,et al.  The role of single nucleotide polymorphisms (SNPs) in understanding complex disorders and pharmacogenomics. , 2000, Annals of the Academy of Medicine, Singapore.

[7]  J. Witte,et al.  Genetic dissection of complex traits , 1996, Nature Genetics.

[8]  G. Chelvanayagam,et al.  Database Analysis and Gene Discovery in Pharmacogenetics , 2000, Clinical chemistry and laboratory medicine.

[9]  David J. Porteous,et al.  In silico identification of transcripts and SNPs from a region of 4p linked with bipolar affective disorder , 2000, Bioinform..

[10]  Yan P. Yuan,et al.  HGBASE: a database of SNPs and other variations in and around human genes , 2000, Nucleic Acids Res..

[11]  P. Kwok,et al.  Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. , 1998, Genome research.

[12]  Michael N. Edmonson,et al.  Reliable identification of large numbers of candidate SNPs from public EST data , 1999, Nature Genetics.

[13]  I. Gray,et al.  Single nucleotide polymorphisms as tools in human genetics. , 2000, Human molecular genetics.

[14]  F. Gabreëls,et al.  A second common mutation in the methylenetetrahydrofolate reductase gene: an additional risk factor for neural-tube defects? , 1998, American journal of human genetics.

[15]  E. Lander,et al.  Characterization of single-nucleotide polymorphisms in coding regions of human genes , 1999 .

[16]  R. Rozen,et al.  A second genetic polymorphism in methylenetetrahydrofolate reductase (MTHFR) associated with decreased enzyme activity. , 1998, Molecular genetics and metabolism.

[17]  Kei-Hoi Cheung,et al.  ALFRED: an allele frequency database for diverse populations and DNA polymorphisms , 2000, Nucleic Acids Res..

[18]  D. Cox,et al.  Data mining: Efficiency of using sequence databases for polymorphism discovery , 2001, Human mutation.

[19]  P. Beaune,et al.  Polymorphisms of human aryl hydrocarbon receptor (AhR) gene in a French population: relationship with CYP1A1 inducibility and lung cancer. , 2001, Carcinogenesis.