Equivalent Indels – Ambiguous Functional Classes and Redundancy in Databases

There is considerable interest in studying sequenced variations. However, while the positions of substitutions are uniquely identifiable by sequence alignment, the location of insertions and deletions still poses problems. Each insertion and deletion causes a change of sequence. Yet, due to low complexity or repetitive sequence structures, the same indel can sometimes be annotated in different ways. Two indels which differ in allele sequence and position can be one and the same, i.e. the alternative sequence of the whole chromosome is identical in both cases and, therefore, the two deletions are biologically equivalent. In such a case, it is impossible to identify the exact position of an indel merely based on sequence alignment. Thus, variation entries in a mutation database are not necessarily uniquely defined. We prove the existence of a contiguous region around an indel in which all deletions of the same length are biologically identical. Databases often show only one of several possible locations for a given variation. Furthermore, different data base entries can represent equivalent variation events. We identified 1,045,590 such problematic entries of insertions and deletions out of 5,860,408 indel entries in the current human database of Ensembl. Equivalent indels are found in sequence regions of different functions like exons, introns or 5' and 3' UTRs. One and the same variation can be assigned to several different functional classifications of which only one is correct. We implemented an algorithm that determines for each indel database entry its complete set of equivalent indels which is uniquely characterized by the indel itself and a given interval of the reference sequence.

[1]  Ralf H. Bortfeldt,et al.  NovelSNPer: A Fast Tool for the Identification and Characterization of Novel SNPs and InDels , 2011, Adv. Bioinformatics.

[2]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[3]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[4]  H. Hsieh-Li,et al.  Neuroprotective effects of granulocyte‐colony stimulating factor in a novel transgenic mouse model of SCA17 , 2011, Journal of neurochemistry.

[5]  Laurent Gil,et al.  Ensembl variation resources , 2010, BMC Genomics.

[6]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[7]  Joseph J Gillespie Characterizing regions of ambiguous alignment caused by the expansion and contraction of hairpin-stem loops in ribosomal RNA molecules. , 2004, Molecular phylogenetics and evolution.

[8]  Mingming Jia,et al.  COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer , 2010, Nucleic Acids Res..

[9]  Sebastian Bauer,et al.  Microindel detection in short-read sequence data , 2010, Bioinform..

[10]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[11]  D. Lipman,et al.  National Center for Biotechnology Information , 2019, Springer Reference Medizin.

[12]  Yu Kyeong Kim,et al.  Relative contribution of SCA2, SCA3 and SCA17 in Korean patients with parkinsonism and ataxia. , 2011, Parkinsonism & related disorders.

[13]  N. Bonini,et al.  RNA toxicity is a component of ataxin-3 degeneration in Drosophila , 2008, Nature.

[14]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[15]  Daniel Rios,et al.  A database and API for variation, dense genotyping and resequencing data , 2010, BMC Bioinformatics.

[16]  Christopher C. Ebmeier,et al.  Activator-Mediator binding regulates Mediator-cofactor interactions , 2010, Proceedings of the National Academy of Sciences.

[17]  John S. Satterlee,et al.  An ARC/Mediator subunit required for SREBP control of cholesterol and lipid homeostasis , 2006, Nature.

[18]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[19]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[20]  R. J. Donatelli,et al.  When decisions on homologous structures cause ambiguous taxa relationships: the Neomorphinae (Aves, Cuculidae) example. , 2010, Brazilian journal of biology = Revista brasleira de biologia.

[21]  S. Hague,et al.  Neurodegenerative disorders: Parkinson’s disease and Huntington’s disease , 2005, Journal of Neurology, Neurosurgery & Psychiatry.