CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

Motivation Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. Results We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. Availability CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. Contact wangying@xmu.edu.cn; fsun@usc.edu Supplementary information Supplementary data are available at Bioinformatics online.

[1]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[2]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[3]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[4]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[5]  Daniel J. Blankenberg,et al.  28-way vertebrate alignment and conservation track in the UCSC Genome Browser. , 2007, Genome research.

[6]  Paul Greenfield,et al.  k-mer Similarity, Networks of Microbial Genomes, and Taxonomic Rank , 2017, mSystems.

[7]  Matteo Comin,et al.  Benchmarking of alignment-free sequence comparison methods , 2019 .

[8]  Mark A. Ragan,et al.  Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer , 2016, Scientific Reports.

[9]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[10]  J. Clemente,et al.  Diet Drives Convergence in Gut Microbiome Functions Across Mammalian Phylogeny and Within Humans , 2011, Science.

[11]  Ying Wang,et al.  Comparison of Metatranscriptomic Samples Based on k-Tuple Frequencies , 2014, PloS one.

[12]  Samuel Karlin,et al.  Distinctive features of large complex virus genomes and proteomes , 2007, Proceedings of the National Academy of Sciences.

[13]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[14]  Ting Chen,et al.  COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO‐alignment and paired‐end read LinkAge , 2016, Bioinform..

[15]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[16]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[17]  Sanjeev Galande,et al.  One size does not fit all: On how Markov model order dictates performance of genomic sequence analyses , 2012, Nucleic acids research.

[18]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[19]  S. O’Brien,et al.  A Molecular Phylogeny of Living Primates , 2011, PLoS genetics.

[20]  Gesine Reinert,et al.  Alignment-Free Sequence Analysis and Applications. , 2018, Annual review of biomedical data science.

[21]  Vineet Bafna,et al.  Skmer: assembly-free and alignment-free sample identification using genome skims , 2019, Genome Biology.

[22]  Jed A. Fuhrman,et al.  CAFE: aCcelerated Alignment-FrEe sequence analysis , 2017, Nucleic Acids Res..

[23]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[24]  Joaquín Dopazo,et al.  PupasView: a visual tool for selecting suitable SNPs, with putative pathological effect in genes, for genotyping purposes , 2005, Nucleic Acids Res..

[25]  Korbinian Strimmer,et al.  APE: Analyses of Phylogenetics and Evolution in R language , 2004, Bioinform..

[26]  Huiguang Yi,et al.  Co-phylog: an assembly-free phylogenomic approach for closely related organisms , 2010, Nucleic acids research.

[27]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[28]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[29]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..