Effects of error-correction of heterozygous next-generation sequencing data

BackgroundError correction is an important step in increasing the quality of next-generation sequencing data for downstream analysis and use. Polymorphic datasets are a challenge for many bioinformatic software packages that are designed for or assume homozygosity of an input dataset. This assumption ignores the true genomic composition of many organisms that are diploid or polyploid. In this survey, two different error correction packages, Quake and ECHO, are examined to see how they perform on next-generation sequence data from heterozygous genomes.ResultsQuake and ECHO perform well and were able to correct many errors found within the data. However, errors that occur at heterozygous positions had unique trends. Errors at these positions were sometimes corrected incorrectly, introducing errors into the dataset with the possibility of creating a chimeric read. Quake was much less likely to create chimeric reads. Quake's read trimming removed a large portion of the original data and often left reads with few heterozygous markers. ECHO resulted in more chimeric reads and introduced more errors than Quake but preserved heterozygous markers.Using real E. coli sequencing data and their assemblies after error correction, the assembly statistics improved. It was also found that segregating reads by haplotype can improve the quality of an assembly.ConclusionsThese findings suggest that Quake and ECHO both have strengths and weaknesses when applied to heterozygous data. With the increased interest in haplotype specific analysis, new tools that are designed to be haplotype-aware are necessary that do not have the weaknesses of Quake and ECHO.

[1]  Eleazar Eskin,et al.  Leveraging reads that span multiple single nucleotide polymorphisms for haplotype inference from sequencing data , 2013, Bioinform..

[2]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[3]  Sylvan Wallenstein,et al.  Haplotype-Phenotype Relationships of Paraoxonase-1 , 2005, Cancer Epidemiology Biomarkers & Prevention.

[4]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[5]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[6]  Jill P Mesirov,et al.  Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. , 2005, Genome research.

[7]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[8]  BMC Bioinformatics , 2005 .

[9]  C. DeLisi,et al.  Phenotypic connections in surprising places , 2010, Genome Biology.

[10]  Andrew H. Chan,et al.  ECHO: a reference-free short-read error correction algorithm. , 2011, Genome research.

[11]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[12]  Laurent Gil,et al.  Ensembl 2013 , 2012, Nucleic Acids Res..

[13]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[14]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.