Frame: detection of genomic sequencing errors

MOTIVATION The underlying error rate for genomic sequencing sometimes results in the introduction of artificial frameshifts and in-frame stop codons into putative protein encoding genes. Severe errors are then introduced into the inferred transcripts through mis-translation or premature termination. RESULTS We describe a system for screening segments of DNA for frameshift and in-frame stop errors in coding regions. The method is based on homology matching using blastx to compare all six reading frames of the query nucleotide sequence against selected protein sequence databases. Fragments of protein matching neighbouring regions of the query DNA are united and extended laterally to define candidate open reading frames, within which, frameshifts and stops are identified. Suitable targets include prokaryotic or other intron-free genomic sequence and complementary DNAs. As an example of its use, we report here two frameshifted ORFs that deviate from the original TIGR sequence annotations for the recently released Helicobacter pylori genome. AVAILABILITY The tool is accessible via the URL http://www.sander.ebi.ac.uk/frame/. CONTACT brown@ebi.ac.uk.

[1]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[2]  Larry Wall,et al.  Programming Perl , 1991 .

[3]  R J Roberts,et al.  Finding errors in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[4]  D. States Molecular sequence accuracy: analysing imperfect data. , 1992, Trends in genetics : TIG.

[5]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[6]  S. Beck Accuracy of DNA sequencing: should the sequence quality be monitored? , 1993, DNA sequence : the journal of DNA sequencing and mapping.

[7]  A Bairoch,et al.  Go hunting in sequence databases but watch out for the traps. , 1996, Trends in genetics : TIG.

[8]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[9]  Xiaojun Guan,et al.  Alignments of DNA and protein sequences containing frameshift errors , 1996, Comput. Appl. Biosci..

[10]  T J Gibson,et al.  PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. , 1996, Nucleic acids research.

[11]  J. Zhang,et al.  Methods for comparing a DNA sequence with a protein sequence , 1996, Comput. Appl. Biosci..

[12]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[13]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.