Fast parallel construction of variable-length Markov chains

Background Alignment-free methods are a popular approach for comparing biological sequences, including complete genomes. The methods range from probability distributions of sequence composition to first and higher-order Markov chains, where a k -th order Markov chain over DNA has $$4^k$$ 4 k formal parameters. To circumvent this exponential growth in parameters, variable-length Markov chains (VLMCs) have gained popularity for applications in molecular biology and other areas. VLMCs adapt the depth depending on sequence context and thus curtail excesses in the number of parameters. The scarcity of available fast, or even parallel software tools, prompted the development of a parallel implementation using lazy suffix trees and a hash-based alternative. Results An extensive evaluation was performed on genomes ranging from 12Mbp to 22Gbp. Relevant learning parameters were chosen guided by the Bayesian Information Criterion (BIC) to avoid over-fitting. Our implementation greatly improves upon the state-of-the-art even in serial execution. It exhibits very good parallel scaling with speed-ups for long sequences close to the optimum indicated by Amdahl’s law of 3 for 4 threads and about 6 for 16 threads, respectively. Conclusions Our parallel implementation released as open-source under the GPLv3 license provides a practically useful alternative to the state-of-the-art which allows the construction of VLMCs even for very large genomes significantly faster than previously possible. Additionally, our parameter selection based on BIC gives guidance to end-users comparing genomes.

[1]  David C. Torney,et al.  Computation of d 2: A Measure of Sequence Dissimilarity , 2018 .

[2]  Donald A. Adjeroh,et al.  Probabilistic suffix array: efficient modeling and prediction of protein families , 2012, Bioinform..

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  Moritz G. Maaß Computing suffix links for suffix trees and arrays , 2007, Inf. Process. Lett..

[5]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[6]  P. Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[7]  Fabio Cunial,et al.  A framework for space-efficient variable-order Markov models , 2018 .

[8]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[9]  G. Bernardi,et al.  Codon usage and genome composition , 2005, Journal of Molecular Evolution.

[10]  Gill Bejerano Algorithms for variable length Markov chain modeling , 2004, Bioinform..

[11]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[12]  Imre Csiszár,et al.  Context tree estimation for not necessarily finite memory processes, via BIC and MDL , 2005, IEEE Transactions on Information Theory.

[13]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[14]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[15]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[16]  Robert Giegerich,et al.  A Comparison of Imperative and Purely Functional Suffix Tree Constructions , 1995, Sci. Comput. Program..

[17]  Devdatt P. Dubhashi,et al.  Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures , 2006, Bioinform..

[18]  Fabio Cunial,et al.  A framework for space-efficient variable-order Markov models , 2019, Bioinform..

[19]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[20]  Golan Yona,et al.  Modeling protein families using probabilistic suffix trees , 1999, RECOMB.

[21]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[22]  Alexander Schliep,et al.  Decoding non-unique oligonucleotide hybridization experiments of targets related by a phylogenetic tree , 2006, ISMB.

[23]  Peter Schüller,et al.  Unsupervised mode detection in cyber-physical systems using variable order Markov models , 2017, 2017 IEEE 15th International Conference on Industrial Informatics (INDIN).

[24]  Alberto Apostolico,et al.  Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space , 2000, RECOMB '00.

[25]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[26]  M. Omair Ahmad,et al.  Prediction of Indel flanking regions in protein sequences using a variable-order Markov model , 2015, Bioinform..

[27]  Sanjeev Galande,et al.  One size does not fit all: On how Markov model order dictates performance of genomic sequence analyses , 2012, Nucleic acids research.

[28]  Alberto Apostolico,et al.  Optimal Amnesic Probabilistic Automata or How to Learn and Classify Proteins in Linear Time and Space , 2000, J. Comput. Biol..

[29]  Alexander Schliep,et al.  Turtle: Identifying frequent k-mers with cache-efficient algorithms , 2013, Bioinform..

[30]  Devdatt P. Dubhashi,et al.  The IncP-1 plasmid backbone adapts to different host bacterial species and evolves through homologous recombination , 2011, Nature communications.

[31]  Peter Bühlmann,et al.  Model Selection for Variable Length Markov Chains and Tuning the Context Algorithm , 2000 .

[32]  Knut Reinert,et al.  The SeqAn C++ template library for efficient sequence analysis: A resource for programmers. , 2017, Journal of biotechnology.

[33]  Devdatt P. Dubhashi,et al.  A New Order Estimator for Fixed and Variable Length Markov Models with Applications to DNA Sequence Similarity , 2006, Statistical applications in genetics and molecular biology.

[34]  Fengzhu Sun,et al.  Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains , 2016, Scientific Reports.

[35]  D. Davison,et al.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.

[36]  Martin Vingron,et al.  Fast and Adaptive Variable Order Markov Chain Construction , 2008, WABI.

[37]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[38]  Robert Giegerich,et al.  Efficient implementation of lazy suffix trees , 2003, Softw. Pract. Exp..

[39]  Aurélien Garivier Consistency of the Unlimited BIC Context Tree Estimator , 2006, IEEE Transactions on Information Theory.

[40]  Robert Giegerich,et al.  Efficient implementation of lazy suffix trees , 1999, Softw. Pract. Exp..

[41]  Jian Xu,et al.  Predicting next location using a variable order Markov model , 2014, IWGS.

[42]  M. Cecchini,et al.  Ultrastructural Characterization of the Lower Motor System in a Mouse Model of Krabbe Disease , 2016, Scientific Reports.

[43]  Alexander Schliep,et al.  Selecting signature oligonucleotides to identify organisms using DNA arrays , 2002, Bioinform..

[44]  Alexander Schliep,et al.  Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees , 2012, Bioinform..

[45]  Peter Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[46]  Susana Vinga,et al.  Biological sequence analysis by vector-valued functions : revisiting alignment-free methodologies for DNA and protein classification , 2011 .