To the Editor:
With the recent release of the genome-wide sequence for multiple inbred mouse strains1, and with resequencing data for a large number of additional strains entering the public domain (http://www.niehs.nih.gov/crg/cprc.htm), we are one step closer to being able to identify the underlying genetic variants responsible for the trait characteristics that define each strain. Here, we describe a genome-wide catalog of coding variation in the mouse genome that was developed using an extensive collection of mouse DNA sequence reads, including those recently released by Celera, data from dbSNP2 and resequencing data generated by Perlegen Sciences for the US National Institute of Environmental Health Sciences (NIEHS). To display these data, we developed a new software tool, TranscriptSNPView, which has been integrated into the Ensembl Genome Browser to take advantage of the evolving mouse genome assembly and the latest Ensembl3 and Vega gene predictions4. TranscriptSNPView can be accessed via the Ensembl Genome Browser (http://www.ensembl.org/Mus_musculus/transcriptsnpview).
TranscriptSNPView displays coding SNP data from 48 mouse strains (Supplementary Table 1 online). Using the SNP calling algorithm ssahaSNP5, we computed over 50 million SNPs from the common laboratory Mus musculus strains A/J, DBA/2J, 129X1/SvJ and 129S1/SvImJ from whole-genome shotgun sequence reads generated by Celera, and from C3HeB/FeJ and NOD BAC-end sequence reads generated by the Wellcome Trust Sanger Institute. We also generated SNP calls from the Mus musculus molossinus strain MSM/Ms using sequence reads generated by RIKEN6 (Supplementary Table 1). Collectively, these SNP calls have been designated ‘Sanger SNPs’. The 25 million DNA sequence reads used to generate the Sanger SNP collection represent 7.32-fold coverage of the NCBI mouse build 35 genome assembly and are available via the Ensembl trace repository (http://trace.ensembl.org).
The Sanger SNP calls were distilled to 6.87 million nonredundant genome-wide SNP features and were combined with an additional 6.4 million dbSNP entries (version 126), providing data for an additional 41 mouse strains. By merging these data sets and mapping them against the Ensembl 38.35 mouse gene build, we collated 726,462 coding SNP variants across all strains and computed their amino acid consequences to identify 249,996 nonsynonymous coding changes and 2,667 stop codons. Coding SNP figures for each strain are provided in Supplementary Table 1. We also identified instances where stop codons had been lost, and we predicted mutations in introns, invariant intronic splice sites and in untranslated and regulatory regions. These predictions, which can be used as a basis for identifying functional SNP variants, are displayed in TranscriptSNPView. A detailed description of all of the features of TranscriptSNPView is provided in the Supplementary Note online.
A data collection of this quality and depth is unprecedented and will provide the means to obtain a high-resolution picture of coding variation in the mouse genome. TrancriptSNPView represents a powerful new tool for functional analysis of the mouse genome and will become a central repository for mouse coding variation data.
[1]
Elizabeth M. Smigielski,et al.
dbSNP: the NCBI database of genetic variation
,
2001,
Nucleic Acids Res..
[2]
J. Mullikin,et al.
SSAHA: a fast search method for large DNA databases.
,
2001,
Genome research.
[3]
Toshio Kojima,et al.
Contribution of Asian mouse subspecies Mus musculus molossinus to genomic constitution of strain C57BL/6J, as defined by BAC-end sequence-SNP analysis.
,
2004,
Genome research.
[4]
Free genome databases finally defeat Celera
,
2005,
Nature.