LAF: Logic Alignment Free and its application to bacterial genomes classification

Alignment-free algorithms can be used to estimate the similarity of biological sequences and hence are often applied to the phylogenetic reconstruction of genomes. Most of these algorithms rely on comparing the frequency of all the distinct substrings of fixed length (k-mers) that occur in the analyzed sequences.In this paper, we present Logic Alignment Free (LAF), a method that combines alignment-free techniques and rule-based classification algorithms in order to assign biological samples to their taxa. This method searches for a minimal subset of k-mers whose relative frequencies are used to build classification models as disjunctive-normal-form logic formulas (if-then rules).We apply LAF successfully to the classification of bacterial genomes to their corresponding taxonomy. In particular, we succeed in obtaining reliable classification at different taxonomic levels by extracting a handful of rules, each one based on the frequency of just few k-mers.State of the art methods to adjust the frequency of k-mers to the character distribution of the underlying genomes have negligible impact on classification performance, suggesting that the signal of each class is strong and that LAF is effective in identifying it.

[1]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[2]  R. Elton,et al.  Doublet frequency analysis of fractionated vertebrate nuclear DNA. , 1976, Journal of molecular biology.

[3]  Vladimir Pavlovic,et al.  Efficient alignment-free DNA barcode analytics , 2009, BMC Bioinformatics.

[4]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[5]  Jonas S. Almeida,et al.  Universal sequence map (USM) of arbitrary discrete sequences , 2002, BMC Bioinformatics.

[6]  Mourad Elloumi,et al.  Motalign: A Multiple Sequence Alignment Algorithm Based on a New Distance and a New Score Function , 2013, 2013 24th International Workshop on Database and Expert Systems Applications.

[7]  Klaus Truemper,et al.  A MINSAT Approach for Learning in Logic Domains , 2002, INFORMS J. Comput..

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Siu-Ming Yiu,et al.  A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio , 2011, Bioinform..

[10]  Takashi Yoneya,et al.  TCP: a tool for designing chimera proteins based on the tertiary structure information , 2009, BMC Bioinformatics.

[11]  Daniel Kudenko,et al.  Feature Generation for Sequence Categorization , 1998, AAAI/IAAI.

[12]  Derek Gatherer,et al.  Genome Signatures, Self-Organizing Maps and Higher Order Phylogenies: A Parametric Analysis , 2007, Evolutionary bioinformatics online.

[13]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[14]  M. Blaser,et al.  Evolutionary implications of microbial genome tetranucleotide frequency biases. , 2003, Genome research.

[15]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[16]  P. Bork,et al.  Environments shape the nucleotide composition of genomes , 2005, EMBO reports.

[17]  Ying Xu,et al.  Barcodes for genomes and applications , 2008, BMC Bioinformatics.

[18]  Antti Honkela,et al.  Exploration and retrieval of whole-metagenome sequencing samples , 2013, Bioinform..

[19]  G. Russell,et al.  Similarity of the general designs of protochordates and invertebrates , 1977, Nature.

[20]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[21]  A. Kornberg,et al.  Enzymatic synthesis of deoxyribonucleic acid. X. Influence of bromouracil substitutions on replication. , 1962, Proceedings of the National Academy of Sciences of the United States of America.

[22]  P. Deschavanne,et al.  Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , 1999, Molecular biology and evolution.

[23]  Kai Song,et al.  Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads , 2013, J. Comput. Biol..

[24]  Patrick M Hayes,et al.  Construction and application for QTL analysis of a Restriction Site Associated DNA (RAD) linkage map in barley , 2011, BMC Genomics.

[25]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[26]  Matteo Comin,et al.  Whole-Genome Phylogeny by Virtue of Unic Subwords , 2012, 2012 23rd International Workshop on Database and Expert Systems Applications.

[27]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[28]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[29]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[30]  Giovanni Felici,et al.  Human polyomaviruses identification by logic mining techniques , 2012, Virology Journal.

[31]  Giovanni Felici,et al.  Learning to classify species with barcodes , 2009, BMC Bioinformatics.

[32]  Douglas L. Brutlag,et al.  Remote homology detection: a motif based approach , 2003, ISMB.

[33]  Jing Yuan,et al.  Rule based classifier for the analysis of gene-gene and gene-environment interactions in genetic association studies , 2009, BioData Mining.

[34]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[35]  P. Bucher,et al.  Classification of selectively constrained DNA elements using feature vectors and rule-based classifiers. , 2014, Genomics.

[36]  J. Leader,et al.  A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. , 2002, Molecular biology and evolution.

[37]  J. Parkhill,et al.  Comparative genomic structure of prokaryotes. , 2004, Annual review of genetics.

[38]  Hasan Ogul,et al.  SVM-based detection of distant protein structural relationships using pairwise probabilistic suffix trees , 2006, Comput. Biol. Chem..

[39]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[40]  Gregory A Petsko What my genome told me - and what it didn't , 2009, Genome Biology.

[41]  Robert Olson,et al.  Real Time Metagenomics: Using k-mers to annotate metagenomes , 2012, Bioinform..

[42]  Victor V. Solovyev,et al.  A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization , 1993, Comput. Appl. Biosci..

[43]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[44]  Raymond H. Chan,et al.  Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[45]  Giovanni Felici,et al.  Supervised DNA Barcodes species classification: analysis, comparisons and results , 2014, BioData Mining.

[46]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[47]  Naruya Saitou,et al.  Estimation of bacterial species phylogeny through oligonucleotide frequency distances. , 2009, Genomics.

[48]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[49]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[50]  R. Sandberg,et al.  Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[51]  Arthur Brady,et al.  MetaRef: a pan-genomic database for comparative and community microbial genomics , 2013, Nucleic Acids Res..

[52]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[53]  Michael Hackenberg,et al.  ContDist: a tool for the analysis of quantitative gene and promoter properties , 2009, BMC Bioinformatics.

[54]  Brian R. Gaines,et al.  Induction of ripple-down rules applied to modeling large databases , 1995, Journal of Intelligent Information Systems.

[55]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[56]  Susana Vinga,et al.  Biological sequence analysis by vector-valued functions : revisiting alignment-free methodologies for DNA and protein classification , 2011 .

[57]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[58]  Ernest,et al.  Enzymatic synthesis of deoxyribonucleic acid. , 1969, Harvey lectures.

[59]  C Ouzounis,et al.  Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins , 1999, Proteins.

[60]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[61]  Julie D Thompson,et al.  Multiple Sequence Alignment Using ClustalW and ClustalX , 2003, Current protocols in bioinformatics.

[62]  S Karlin,et al.  Compositional biases of bacterial genomes and evolutionary implications , 1997, Journal of bacteriology.

[63]  Kai Song,et al.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing , 2014, Briefings Bioinform..

[64]  S Karlin,et al.  Compositional differences within and between eukaryotic genomes. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[65]  M. Pop,et al.  Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences , 2011, BMC Genomics.

[66]  Siu-Ming Yiu,et al.  MetaCluster 4.0: A Novel Binning Algorithm for NGS Reads and Huge Number of Species , 2012, J. Comput. Biol..

[67]  Kai Song,et al.  Alignment-Free Sequence Comparison Based on Next Generation Sequencing Reads: Extended Abstract , 2012, RECOMB.

[68]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[69]  J. Josse,et al.  Enzymatic synthesis of deoxyribonucleic acid. VIII. Frequencies of nearest neighbor base sequences in deoxyribonucleic acid. , 1961, The Journal of biological chemistry.

[70]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[71]  Jian Pei,et al.  A brief survey on sequence classification , 2010, SKDD.

[72]  Tao Jiang,et al.  Separating metagenomic short reads into genomes via clustering , 2012, Algorithms for Molecular Biology.

[73]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[74]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[75]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[76]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[77]  K. Chu,et al.  Phylogeny of Prokaryotes and Chloroplasts Revealed by a Simple Composition Approach on All Protein Sequences from Complete Genomes Without Sequence Alignment , 2005, Journal of Molecular Evolution.

[78]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.