Novel read density distribution score shows possible aligner artefacts, when mapping a single chromosome

BackgroundThe use of artificial data to evaluate the performance of aligners and peak callers not only improves its accuracy and reliability, but also makes it possible to reduce the computational time. One of the natural ways to achieve such time reduction is by mapping a single chromosome.ResultsWe investigated whether a single chromosome mapping causes any artefacts in the alignments’ performances. In this paper, we compared the accuracy of the performance of seven aligners on well-controlled simulated benchmark data which was sampled from a single chromosome and also from a whole genome.We found that commonly used statistical methods are insufficient to evaluate an aligner performance, and applied a novel measure of a read density distribution similarity, which allowed to reveal artefacts in aligners’ performances.We also calculated some interesting mismatch statistics, and constructed mismatch frequency distributions along the read.ConclusionsThe generation of artificial data by mapping of reads generated from a single chromosome to a reference chromosome is justified from the point of view of reducing the benchmarking time. The proposed quality assessment method allows to identify the inherent shortcoming of aligners that are not detected by conventional statistical methods, and can affect the quality of alignment of real data.

[1]  D. Posada,et al.  A comparison of tools for the simulation of genomic next-generation sequencing data , 2016, Nature Reviews Genetics.

[2]  David G. Knowles,et al.  Fast Computation and Applications of Genome Mappability , 2012, PloS one.

[3]  Bairong Shen,et al.  Evaluation and Comparison of Multiple Aligners for Next-Generation Sequencing Data Analysis , 2014, BioMed research international.

[4]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[5]  Peter F. Stadler,et al.  Lacking alignments? The next-generation sequencing mapper segemehl revisited , 2014, Bioinform..

[6]  Inna Dubchak,et al.  benchNGS : An approach to benchmark short reads alignment tools , 2015, bioRxiv.

[7]  Gabor T. Marth,et al.  MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping , 2013, PloS one.

[8]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[9]  Zemin Ning,et al.  SMALT – A new mapper for DNA sequencing reads , 2010 .

[10]  Michael P Snyder,et al.  High-throughput sequencing for biology and medicine , 2013, Molecular systems biology.

[11]  Matthew Ruffalo,et al.  Comparative analysis of algorithms for next-generation sequencing read alignment , 2011, Bioinform..

[12]  I. René J. A. te Boekhorst,et al.  Statistical Measures of the Structure of Genomic Sequences: Entropy, Complexity, and Position Information , 2006, J. Bioinform. Comput. Biol..

[13]  Yuriy L. Orlov,et al.  Blurring of High-Resolution Data Shows that the Effect of Intrinsic Nucleosome Occupancy on Transcription Factor Binding is Mostly Regional, Not Local , 2010, PLoS Comput. Biol..

[14]  Véronique Martin,et al.  Mapping Reads on a Genomic Sequence: An Algorithmic Overview and a Practical Comparative Analysis , 2012, J. Comput. Biol..

[15]  S. Caboche,et al.  Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data , 2014, BMC Genomics.

[16]  Ümit V. Çatalyürek,et al.  Benchmarking short sequence mapping tools , 2013, BMC Bioinformatics.

[17]  Nuno A. Fonseca,et al.  Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[18]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[19]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.