A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads

BackgroundTaxonomic assignment is a crucial step in a metagenomic project which aims to identify the origin of sequences in an environmental sample. Among the existing methods, since composition-based algorithms are not sufficient for classifying short reads, recent algorithms use only the feature of similarity, or similarity-based combined features. However, those algorithms suffer from the computational expense because the task of similarity search is very time-consuming. Besides, the lack of similarity information between reads and reference sequences due to the length of short reads reduces significantly the classification quality.ResultsThis paper presents a novel taxonomic assignment algorithm, called SeMeta, which is based on semi-supervised learning to produce a fast and highly accurate classification of short-length reads with sufficient mutual overlap. The proposed algorithm firstly separates reads into clusters using their composition feature. It then labels the clusters with the support of an efficient filtering technique on results of the similarity search between their reads and reference databases. Furthermore, instead of performing the similarity search for all reads in the clusters, SeMeta only does for reads in their subgroups by utilizing the information of sequence overlapping. The experimental results demonstrate that SeMeta outperforms two other similarity-based algorithms on different aspects.ConclusionsBy using a semi-supervised method as well as taking the advantages of various features, the proposed algorithm is able not only to achieve high classification quality, but also to reduce much computational cost. The source codes of the algorithm can be downloaded at http://it.hcmute.edu.vn/bioinfo/metapro/SeMeta.html

[1]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[2]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[3]  Naryttza N. Diaz,et al.  TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach , 2009, BMC Bioinformatics.

[4]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[5]  Le Vinh,et al.  A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads , 2015, Algorithms for Molecular Biology.

[6]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[7]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[8]  David L. Olson,et al.  Advanced Data Mining Techniques , 2008 .

[9]  Chittibabu Guda,et al.  MetaID: A novel method for identification and quantification of metagenomic samples , 2013, BMC Genomics.

[10]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[11]  Huzefa Rangwala,et al.  TAC-ELM: Metagenomic Taxonomic Classification with Extreme Learning Machines , 2011, BICoB.

[12]  David Galvin Two problems on independent sets in graphs , 2011, Discret. Math..

[13]  Monzoorul Haque Mohammed,et al.  SPHINX - an algorithm for taxonomic binning of metagenomic sequences , 2011, Bioinform..

[14]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[15]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[16]  M. Pop,et al.  Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences , 2011, BMC Genomics.

[17]  Siu-Ming Yiu,et al.  MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning , 2014, BMC Genomics.

[18]  J. T. Dunnen,et al.  Next generation sequencing technology: Advances and applications. , 2014, Biochimica et biophysica acta.

[19]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[20]  J. Stoye,et al.  Taxonomic classification of metagenomic shotgun sequences with CARMA3 , 2011, Nucleic acids research.

[21]  Frank Oliver Glöckner,et al.  Current opportunities and challenges in microbial metagenome analysis—a bioinformatic perspective , 2012, Briefings Bioinform..

[22]  Alexander Goesmann,et al.  AKE - the Accelerated k-mer Exploration web-tool for rapid taxonomic classification and visualization , 2014, BMC Bioinformatics.

[23]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[24]  Monzoorul Haque Mohammed,et al.  SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences , 2009, Bioinform..

[25]  Pavan Balaji,et al.  SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores , 2014, BMC Bioinformatics.

[26]  D. Pham,et al.  Selection of K in K-means clustering , 2005 .

[27]  Siu-Ming Yiu,et al.  MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[28]  Monzoorul Haque Mohammed,et al.  Classification of metagenomic sequences: methods and challenges , 2012, Briefings Bioinform..

[29]  S. Schuster,et al.  Integrative analysis of environmental sequences using MEGAN4. , 2011, Genome research.

[30]  Tao Jiang,et al.  A Probabilistic Approach to Accurate Abundance-Based Binning of Metagenomic Reads , 2012, WABI.

[31]  Jonathan Dushoff,et al.  Unsupervised statistical clustering of environmental shotgun sequences , 2009, BMC Bioinformatics.

[32]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.