An ecoinformatics tool for microbial community studies : Supervised classification of Amplicon Length Heterogeneity ( ALH ) profiles of 16 S rRNA

Support vector machines (SVM) and K-nearest neighbors (KNN) are two computational machine learning tools that perform supervised classification. This paper presents a novel application of such supervised analytical tools for microbial community profiling and to distinguish patterning among ecosystems. Amplicon length heterogeneity (ALH) profiles from several hypervariable regions of 16S rRNA gene of eubacterial communities from Idaho agricultural soil samples and from Chesapeake Bay marsh sediments were separately analyzed. The profiles from all available hypervariable regions were concatenated to obtain a combined profile, which was then provided to the SVM and KNN classifiers. Each profile was labeled with information about the location or time of its sampling. We hypothesized that after a learning phase using feature vectors from labeled ALH profiles, both these classifiers would have the capacity to predict the labels of previously unseen samples. The resulting classifiers were able to predict the labels of the Idaho soil samples with high accuracy. The classifiers were less accurate for the classification of the Chesapeake Bay sediments suggesting greater similarity within the Bay’s microbial community patterns in the sampled sites. The profiles obtained from the V1+V2 region were more informative than that obtained from any other single region. However, combining them with profiles from the V1 region (with or without the profiles from the V3 region) resulted in the most accurate classification of the samples. The addition

[1]  N. Pace,et al.  Microbial ecology and evolution: a ribosomal RNA approach. , 1986, Annual review of microbiology.

[2]  Gérard Dreyfus,et al.  Single-layer learning revisited: a stepwise procedure for building and training a neural network , 1989, NATO Neurocomputing.

[3]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[4]  Gaolin Zheng,et al.  Neural Network Classifiers and Gene Selection Methods for Microarray Data on Human Lung Adenocarcinoma , 2003 .

[5]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  C. Kuske,et al.  Assessment of Microbial Diversity in Four Southwestern United States Soils by 16S rRNA Gene Terminal Restriction Fragment Analysis , 2000, Applied and Environmental Microbiology.

[7]  Katharine G. Field,et al.  Identification of Nonpoint Sources of Fecal Pollution in Coastal Waters by Using Host-Specific 16S Ribosomal DNA Genetic Markers from Fecal Anaerobes , 2000, Applied and Environmental Microbiology.

[8]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[9]  James R. Cole,et al.  A new version of the RDP (Ribosomal Database Project) , 1999, Nucleic Acids Res..

[10]  R. Griffiths,et al.  Rapid Method for Coextraction of DNA and RNA from Natural Environments for Analysis of Ribosomal DNA- and rRNA-Based Microbial Community Composition , 2000, Applied and Environmental Microbiology.

[11]  C. Litchfield,et al.  Microbial diversity and complexity in hypersaline environments: A preliminary assessment , 2002, Journal of Industrial Microbiology and Biotechnology.

[12]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[13]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[14]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[15]  Yoonkyung Lee,et al.  Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data , 2003, Bioinform..

[16]  M. Kulomaa,et al.  Microbial diversity in a thermophilic aerobic biofilm process: analysis by length heterogeneity PCR (LH-PCR). , 2003, Water research.

[17]  J. Rooney-Varga,et al.  Seasonal changes in the relative abundance of uncultivated sulfate-reducing bacteria in a salt marsh sediment and in the rhizosphere of Spartina alterniflora , 1997, Applied and environmental microbiology.

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[19]  Positron Lifetime Spectra Support Vector Machine in Classification of , 2004 .

[20]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[21]  M. Schutter,et al.  Use of Length Heterogeneity PCR and Fatty Acid Methyl Ester Profiles To Characterize Microbial Communities in Soil , 2000, Applied and Environmental Microbiology.

[22]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..

[23]  E. W. Maas,et al.  Forensic comparison of soils by bacterial community DNA profiling. , 2002, Journal of forensic sciences.

[24]  Lawrence O. Ticknor,et al.  Empirical and Theoretical Bacterial Diversity in Four Arizona Soils , 2002, Applied and Environmental Microbiology.

[25]  E. Paul,et al.  Terminal Restriction Fragment Length Polymorphism Data Analysis for Quantitative Comparison of Microbial Communities , 2003, Applied and Environmental Microbiology.

[26]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[27]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[28]  K. Field,et al.  Microbial community dynamics based on 16S rRNA gene profiles in a Pacific Northwest estuary and its tributaries. , 2005, FEMS microbiology ecology.

[29]  Jean-Philippe Vert,et al.  Support Vector Machine Prediction of Signal Peptide Cleavage Site Using a New Class of Kernels for Strings , 2001, Pacific Symposium on Biocomputing.

[30]  Bernhard Schölkopf,et al.  Comparing support vector machines with Gaussian kernels to radial basis function classifiers , 1997, IEEE Trans. Signal Process..

[31]  L. Cocolin,et al.  Denaturing Gradient Gel Electrophoresis Analysis of the 16S rRNA Gene V1 Region To Monitor Dynamic Changes in the Bacterial Population during Fermentation of Italian Sausages , 2001, Applied and Environmental Microbiology.

[32]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[33]  S. Giovannoni,et al.  Kinetic Bias in Estimates of Coastal Picoplankton Community Structure Obtained by Measurements of Small-Subunit rRNA Gene PCR Amplicon Length Heterogeneity , 1998, Applied and Environmental Microbiology.

[34]  Yuyu Kuang,et al.  Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: lessons from supervised machine learning in functional genomics. , 2002, Nucleic acids research.

[35]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[36]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[37]  C. Litchfield,et al.  A comparison of DNA profiling techniques for monitoring nutrient impact on microbial community composition during bioremediation of petroleum-contaminated soils. , 2003, Journal of microbiological methods.

[38]  John Quackenbush,et al.  Genesis: cluster analysis of microarray data , 2002, Bioinform..

[39]  William Stafford Noble,et al.  Support vector machine classification on the web , 2004, Bioinform..

[40]  K. Schleifer,et al.  Bacterial phylogeny based on 16S and 23S rRNA sequence analysis. , 1994, FEMS microbiology reviews.

[41]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[42]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[43]  Gary J. Olsen,et al.  Ribosomal RNA phylogeny and the primary lines of evolutionary descent , 1986, Cell.

[44]  Lawrence O. Ticknor,et al.  Phylogenetic Specificity and Reproducibility and New Method for Analysis of Terminal Restriction Fragment Profiles of 16S rRNA Genes from Bacterial Communities , 2001, Applied and Environmental Microbiology.

[45]  W. Crosby,et al.  Extensive Profiling of a Complex Microbial Community by High-Throughput Sequencing , 2002, Applied and Environmental Microbiology.