Theoretical and Practical Analyses in Metagenomic Sequence Classification

Metagenomics is the study of genomic sequences in a heterogeneous microbial sample taken, e.g. from soil, water, human microbiome and skin. One of the primary objectives of metagenomic studies is to assign a taxonomic identity to each read sequenced from a sample and then to estimate the abundance of the known clades. With ever-increasing metagenomic datasets obtained from high-throughput sequencing technologies readily available nowadays, several fast and accurate methods have been developed that can work with reasonable computing requirements. Here we provide an overview of the state-of-the-art methods for the classification of metagenomic sequences, especially highlighting theoretical factors that seem to correlate well with practical factors, and could therefore be useful in the choice or development of a new method in experimental contexts. In particular, we emphasize that the information derived from the known genomes and eventually used in the learning and classification processes may create several experimental issues—mostly based on the amount of information used in the processes and its uniqueness, significance, and redundancy,—and some of these issues are intrinsic both in current alignment-based approaches and in compositional ones. This entails the need to develop efficient alignment-free methods that overcome such problems by combining the learning and classification processes in a single framework.

[1]  Stefano Lonardi,et al.  Comprehensive benchmarking and ensemble approaches for metagenomic classifiers , 2017, Genome Biology.

[2]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[3]  Stefano Lonardi,et al.  Higher Classification Accuracy of Short Metagenomic Reads by Discriminative Spaced k-mers , 2015, WABI.

[4]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[5]  Giovanna Rosone,et al.  The colored longest common prefix array computed via sequential scans , 2018, SPIRE.

[6]  Niranjan Nagarajan,et al.  OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis , 2016, GigaScience.

[7]  Matteo Comin,et al.  Beyond Fixed-Resolution Alignment-Free Measures for Mammalian Enhancers Sequence Comparison , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Matteo Comin,et al.  Filtering Degenerate Patterns with Application to Protein Sequence Analysis , 2013, Algorithms.

[9]  Matteo Comin,et al.  SKraken: Fast and Sensitive Classification of Short Metagenomic Reads based on Filtering Uninformative k-mers , 2017, BIOINFORMATICS.

[10]  Matteo Comin,et al.  The Irredundant Class Method for Remote Homology Detection of Protein Sequences , 2011, J. Comput. Biol..

[11]  Duy Tin Truong,et al.  MetaPhlAn2 for enhanced metagenomic taxonomic profiling , 2015, Nature Methods.

[12]  Jean-Philippe Vert,et al.  MetaVW: Large-Scale Machine Learning for Metagenomics Sequence Classification. , 2018, Methods in molecular biology.

[13]  Tze Hau Lam,et al.  Understanding the microbial basis of body odor in pre-pubescent children and teenagers , 2018, Microbiome.

[14]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[15]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[16]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[17]  Po-E Li,et al.  Accurate read-based metagenome characterization using a hierarchical suite of unique signatures , 2015, Nucleic acids research.

[18]  N. Segata,et al.  Shotgun metagenomics, from sampling to analysis , 2017, Nature Biotechnology.

[19]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[20]  Sanguthevar Rajasekaran,et al.  MSC: a metagenomic sequence classification algorithm , 2019, Bioinform..

[21]  Matteo Comin,et al.  Comparing, Ranking, and Filtering Motifs with Character Classes: Application to Biological Sequences Analysis , 2013 .

[22]  Daniel N. Baker,et al.  KrakenUniq: confident and fast metagenomics classification using unique k-mer counts , 2018, Genome Biology.

[23]  Niranjan Nagarajan,et al.  Single-molecule optical genome mapping of a human HapMap and a colorectal cancer cell line , 2015, GigaScience.

[24]  M. Comin,et al.  3. Alignment-Free Measures for Whole-Genome Comparison , 2015 .