Improving the Sequence Ontology terminology for genomic variant annotation

BackgroundThe Genome Variant Format (GVF) uses the Sequence Ontology (SO) to enable detailed annotation of sequence variation. The annotation includes SO terms for the type of sequence alteration, the genomic features that are changed and the effect of the alteration. The SO maintains and updates the specification and provides the underlying ontologicial structure.MethodsA requirements analysis was undertaken to gather terms missing in the SO release at the time, but needed to adequately describe the effects of sequence alteration on a set of variant genomic annotations. We have extended and remodeled the SO to include and define all terms that describe the effect of variation upon reference genomic features in the Ensembl variation databases.ResultsThe new terminology was used to annotate the human reference genome with a set of variants from both COSMIC and dbSNP. A GVF file containing 170,853 sequence alterations was generated using the SO terminology to annotate the kinds of alteration, the effect of the alteration and the reference feature changed. There are four kinds of alteration and 24 kinds of effect seen in this dataset. (Ensembl Variation annotates 34 different SO consequence terms: http://www.ensembl.org/info/docs/variation/predicted_data.html).ConclusionsWe explain the updates to the Sequence Ontology to describe the effect of variation on existing reference features. We have provided a set of annotations using this terminology, and the well defined GVF specification. We have also provided a provisional exploration of this large annotation dataset.

[1]  Chao Chen,et al.  dbVar and DGVa: public archives for genomic structural variation , 2012, Nucleic Acids Res..

[2]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[3]  Daniel Rios,et al.  Bioinformatics Applications Note Databases and Ontologies Deriving the Consequences of Genomic Variants with the Ensembl Api and Snp Effect Predictor , 2022 .

[4]  P. Shannon,et al.  Exome sequencing identifies the cause of a Mendelian disorder , 2009, Nature Genetics.

[5]  Laurent Gil,et al.  Ensembl variation resources , 2010, BMC Genomics.

[6]  S A Forbes,et al.  The Catalogue of Somatic Mutations in Cancer (COSMIC) , 2008, Current protocols in human genetics.

[7]  Christian Duffin,et al.  The wellcome trust sanger institute. , 2015, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[8]  M. Vihinen Variation Ontology for annotation of variation effects and mechanisms , 2014, Genome research.

[9]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2011 , 2011, Nucleic Acids Res..

[10]  Daniel Rios,et al.  A database and API for variation, dense genotyping and resequencing data , 2010, BMC Bioinformatics.

[11]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2013 , 2012, Nucleic Acids Res..

[12]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[13]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[14]  Ourania Horaitis,et al.  The challenge of documenting mutation across the genome: The human genome variation society approach , 2004, Human mutation.

[15]  Karen Eilbeck,et al.  A standard variation file format for human genome sequences , 2010, Genome Biology.

[16]  J. Shendure,et al.  Exome sequencing as a tool for Mendelian disease gene discovery , 2011, Nature Reviews Genetics.

[17]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[18]  Sek Won Kong,et al.  gSearch: a fast and flexible general search tool for whole-genome sequencing , 2012, Bioinform..

[19]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[20]  H. Hakonarson,et al.  Using VAAST to identify an X-linked disorder resulting in lethality in male infants due to N-terminal acetyltransferase deficiency. , 2011, American journal of human genetics.

[21]  M. G. Reese,et al.  A probabilistic disease-gene finder for personal genomes. , 2011, Genome research.

[22]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[23]  Laurent Gil,et al.  Ensembl 2013 , 2012, Nucleic Acids Res..

[24]  Joseph K. Pickrell,et al.  A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes , 2012, Science.

[25]  Iscn International System for Human Cytogenetic Nomenclature , 1978 .