Neural network-based taxonomic clustering for metagenomics

Metagenomic studies inherently involve sampling genetic information from an environment potentially containing thousands of distinctly different microbial organisms. This genetic information is sequenced producing many short fragments (<500 base pair (bp)); each is tentatively a small representative of the DNA coding structure. Any of the fragments may belong to any of the organisms in the sample, but the relationship is unknown a priori. Furthermore, most of these organisms have not been identified and correspondingly are not represented in any of the publicly available search databases. Our goal is to be able to predict the taxonomic classification of an organism based on the fragments obtained from an environmental sample that may include many (some previously unidentified) organisms. To elucidate the diversity and composition of the sample, we first use a supervised naïve Bayes classifier to score the fragments of known genomes, followed by an unsupervised clustering to group fragments from similar organisms together. We are then free to analyze each cluster separately. This is challenging since we are not interested in similar sequences, but sequences that come from similar genomes, which are known to vary widely intra-genomically. Our dataset comprises of an extremely challenging scenario involving clustering fragments at the phyla level, where none of the phyla have been previously seen or identified. We present two variations of our proposed approach, one based on ART and K-means. We show that ART can cluster 500bp fragments from 17 novel phyla at an overall isolation/grouping that is 10% better than K-means and nearly 7 times over chance.

[1]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  Stephen Grossberg,et al.  A massively parallel architecture for a self-organizing neural pattern recognition machine , 1988, Comput. Vis. Graph. Image Process..

[4]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[5]  Jo McEntyre,et al.  The NCBI Handbook , 2002 .

[6]  N. Pace,et al.  Gastrointestinal microbiology enters the metagenomics era , 2008, Current opinion in gastroenterology.

[7]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[8]  Lu Wang,et al.  The NIH Human Microbiome Project. , 2009, Genome research.

[9]  R. Knight,et al.  Bacterial Community Variation in Human Body Habitats Across Space and Time , 2009, Science.

[10]  R. Sandberg,et al.  Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[11]  Stephen Grossberg,et al.  Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system , 1991, Neural Networks.

[12]  Shigehiko Kanaya,et al.  Informatics for unveiling hidden genome signatures. , 2003, Genome research.

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  A. Hsu,et al.  Using Growing Self-Organising Maps to Improve the Binning Process in Environmental Whole-Genome Shotgun Sequencing , 2007, Journal of biomedicine & biotechnology.

[15]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[16]  Li Yu-bin,et al.  On Distributed Learning , 2006 .

[17]  Gail L. Rosen,et al.  Signal Processing for Metagenomics: Extracting Information from the Soup , 2009, Current genomics.

[18]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[19]  Michael Wilson,et al.  Bacteriology of Humans: An Ecological Perspective , 2008 .

[20]  W. Marsden I and J , 2012 .

[21]  Gail A. Carpenter,et al.  Distributed Learning, Recognition, and Prediction by ART and ARTMAP Neural Networks , 1997, Neural Networks.

[22]  Jonathan Dushoff,et al.  Unsupervised statistical clustering of environmental shotgun sequences , 2009, BMC Bioinformatics.

[23]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[24]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[25]  Gail L. Rosen,et al.  Metagenome Fragment Classification Using N-Mer Frequency Profiles , 2008, Adv. Bioinformatics.