An Efficient Algorithm for Identifying Genomic Structural Inversion with Wide-spectrum of Length

Genomic structural inversion is a class of structural variations, and has been widely associated to a series of complex traits and diseases. It has great significance in accurately identifying the inversions from the high-throughput sequencing data for both research and clinical practice. However, detecting inversion is a challenging computational problem. Existing approaches either limit to detect the inversions with specific length intervals or require a significant distribution of the coverage across the candidate interval. In this paper, we propose a novel detection algorithm to accurately identify the inversions with wide-spectrum of length. The proposed algorithm consists of two components: a clustering step and a segmentation and extension step. It first clusters the pair–ended reads to squeeze the candidate intervals. Then, it utilizes the contig assembly strategy to reconstruct the candidate intervals. Meanwhile, a segmentation and extension strategy is implemented. For each candidate interval, a feature vector is calculated, based on the characteristic values. Finally, the algorithm combines the comparison verification results to filter out some potential false positives, and then returns the inversion breakpoints on base-pair resolution. We conduct a series of simulation experiments to verify the performance of proposed algorithm and compare to two very popular approaches, DELLY and Pindel. The results demonstrate that the proposed approach provides better results on handling the inversions with wide-spectrum of length, especially when the inversions with short-to-medium length exist.