CHIC: a short read aligner for pan-genomic references

Recently the topic of computational pan-genomics has gained increasing attention, and particularly the problem of moving from a single-reference paradigm to a pan-genomic one. Perhaps the simplest way to represent a pan-genome is to represent it as a set of sequences. While indexing highly repetitive collections has been intensively studied in the computer science community, the research has focused on efficient indexing and exact pattern patching, making most solutions not yet suitable to be used in bioinformatic analysis pipelines. Results: We present CHIC, a short-read aligner that indexes very large and repetitive references using a hybrid technique that combines Lempel-Ziv compression with Burrows-Wheeler read aligners. Availability: Our tool is open source and available online at https://gitlab.com/dvalenzu/CHIC

[1]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2016 .

[2]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[3]  Szymon Grabowski,et al.  Indexes of Large Genome Collections on a PC , 2014, PloS one.

[4]  Gonzalo Navarro,et al.  Indexing Highly Repetitive Collections , 2012, IWOCA.

[5]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[6]  The Computational Pan-Genomics Consortium,et al.  Computational pan-genomics: status, promises and challenges , 2018, Briefings Bioinform..

[7]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[8]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[9]  Ying Zhang,et al.  Computational pan-genomics: status, promises and challenges , 2016, bioRxiv.

[10]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[11]  Lin Huang,et al.  Short read alignment with populations of genomes , 2013, Bioinform..

[12]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[13]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[14]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[15]  Veli Mäkinen,et al.  Indexing Graphs for Path Queries with Applications in Genome Research , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2015, Nature.

[17]  Hector Ferrada,et al.  Hybrid indexes for repetitive datasets , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[18]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[19]  Huanming Yang,et al.  The Genome of a Mongolian Individual Reveals the Genetic Imprints of Mongolians on Modern Human Populations , 2014, Genome biology and evolution.

[20]  Daniel Valenzuela,et al.  CHICO: A Compressed Hybrid Index for Repetitive Collections , 2016, SEA.

[21]  Simon J. Puglisi,et al.  Searching and Indexing Genomic Databases via Kernelization , 2014, bioRxiv.

[22]  Szymon Grabowski,et al.  Indexing large genome collections on a PC , 2014, ArXiv.

[23]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.