Some Clustering and Classification Problems in High-Throughput Metagenomics and Cheminformatics

In this dissertation, we address three different problems in high-throughput metagenomics and cheminformatics.(1) Metagenomics studies the genomic content of an entire microbial community by simultaneously sequencing all genomes in an environmental sample. The advent of next-generation sequencing (NGS) technologies has drastically reduced sequencing time and cost, leading to the generation of millions of sequences (reads) in a single run. An important problem in metagenomic analysis is to determine and quantify species (or genomes) in a metagenomic sample. The problem is challenging due to an unknown number of genomes and their abundance ratios, presence of repeats and sequencing errors, and the short length of NGS reads. We propose two algorithms to address these challenges. First, we present an algorithm for separating short paired-end reads from genomes with similar abundance levels. Second, we propose a method to accurately estimate the abundance levels of species. The algorithm automatically determines the number of abundance groups in a metagenomic dataset and bins the reads into these groups.(2) NGS coupled with metagenomics has led to the rapid growth of sequence databases and enabled a new branch of microbiology called comparative metagenomics. It is a fast growing field that requires the development of novel supervised learning techniques. In particular, the problem of microbial community classification may have useful applications enabling efficient organization and search in rapidly growing metagenomic databases, detection of disease phenotypes in clinical samples, and forensic identification. We propose a novel supervised classification method for metagenomic samples that takes advantage of the natural structure in microbial community data encoded by a phylogenetic tree.(3) In modern drug discovery, ultra-high-throughput screening is applied to millions of drug-like compounds in one experiment. Hierarchical clustering is an important step in the drug discovery process. Standard implementations of the exact algorithm for hierarchical clustering require O(n 2 ) time and O(n 2 ) memory. Even though approximate hierarchical clustering methods overcome this problem, they either rely on embedding into spaces that are not biologically sensible, or produce very low resolution hierarchical structures. We present a hybrid hierarchical clustering algorithm requiring approximately O(n sqrt(n)) time and O(n sqrt(n)) memory while still preserving the most desirable properties of the exact algorithm.

[1]  Huzefa Rangwala,et al.  Evaluation of short read metagenomic assembly , 2011, BMC Genomics.

[2]  Phat L Tran,et al.  Metabolic Complementarity and Genomics of the Dual Bacterial Symbiosis of Sharpshooters , 2006, PLoS biology.

[3]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[4]  J. Hemmer-Hansen,et al.  Application of SNPs for population genetics of nonmodel organisms: new opportunities and challenges , 2011, Molecular ecology resources.

[5]  Rob Knight,et al.  PyNAST: a flexible tool for aligning sequences to a template alignment , 2009, Bioinform..

[6]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[7]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[8]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[9]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[10]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[11]  Siu-Ming Yiu,et al.  MetaCluster 4.0: A Novel Binning Algorithm for NGS Reads and Huge Number of Species , 2012, J. Comput. Biol..

[12]  P. Bork,et al.  Enterotypes of the human gut microbiome , 2011, Nature.

[13]  Dmitriy Fradkin,et al.  Bayesian Multinomial Logistic Regression for Author Identification , 2005, AIP Conference Proceedings.

[14]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[15]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[16]  M. Waterman,et al.  Estimating the repeat structure and length of DNA sequences using L-tuples. , 2003, Genome research.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  R. Knight,et al.  Bacterial Community Variation in Human Body Habitats Across Space and Time , 2009, Science.

[19]  S. Dongen Graph clustering by flow simulation , 2000 .

[20]  Satoru Miyano,et al.  Open source clustering software , 2004 .

[21]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[22]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[23]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[24]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[25]  J. Parkhill,et al.  Comparative genomic structure of prokaryotes. , 2004, Annual review of genetics.

[26]  Silke Wagner,et al.  Comparing Clusterings - An Overview , 2007 .

[27]  N. Perrimon,et al.  Genome-Wide RNAi Analysis of Growth and Viability in Drosophila Cells , 2004, Science.

[28]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[29]  D. Alland,et al.  A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. , 2007, Journal of microbiological methods.

[30]  Mihai Pop,et al.  Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples , 2009, PLoS Comput. Biol..

[31]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[32]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[33]  Monzoorul Haque Mohammed,et al.  DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences , 2010, BMC Bioinformatics.

[34]  J. Handelsman,et al.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. , 1998, Chemistry & biology.

[35]  Jonathan A. Eisen,et al.  The Phylogenetic Diversity of Metagenomes , 2011, PloS one.

[36]  R. Knight,et al.  Global patterns in bacterial diversity , 2007, Proceedings of the National Academy of Sciences.

[37]  G. Narasimhan,et al.  An eco-informatics tool for microbial community studies: supervised classification of Amplicon Length Heterogeneity (ALH) profiles of 16S rRNA. , 2006, Journal of microbiological methods.

[38]  Hans-Peter Kriegel,et al.  Data bubbles: quality preserving performance boosting for hierarchical clustering , 2001, SIGMOD '01.

[39]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[40]  Aibin Zhan,et al.  High sensitivity of 454 pyrosequencing for detection of rare species in aquatic communities , 2013 .

[41]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[42]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[43]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[44]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[45]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[46]  Anne-Laure Boulesteix,et al.  Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics , 2012, WIREs Data Mining Knowl. Discov..

[47]  Monzoorul Haque Mohammed,et al.  SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences , 2009, Bioinform..

[48]  Saman K. Halgamuge,et al.  BMC Bioinformatics BioMed Central Methodology article Binning sequences using very sparse labels within a metagenome , 2008 .

[49]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[50]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[51]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[52]  M. David,et al.  Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw , 2011, Nature.

[53]  Ronald Rousseau,et al.  Similarity measures in scientometric research: The Jaccard index versus Salton's cosine formula , 1989, Inf. Process. Manag..

[54]  Naryttza N. Diaz,et al.  TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach , 2009, BMC Bioinformatics.

[55]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[56]  Naryttza N. Diaz,et al.  Phylogenetic classification of short environmental DNA fragments , 2008, Nucleic acids research.

[57]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[58]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[59]  S. Tringe,et al.  Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen , 2011, Science.

[60]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[61]  E. Koonin,et al.  Construction and analysis of bacterial artificial chromosome libraries from a marine microbial assemblage. , 2000, Environmental microbiology.

[62]  Kang Ning,et al.  Saliva microbiomes distinguish caries-active from healthy human populations , 2011, The ISME Journal.

[63]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[64]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[65]  R. Hertzberg,et al.  High-throughput screening: new technology for the 21st century. , 2000, Current opinion in chemical biology.

[66]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[67]  Ying Xu,et al.  Barcodes for genomes and applications , 2008, BMC Bioinformatics.

[68]  Meelis Kull,et al.  Fast approximate hierarchical clustering using similarity heuristics , 2008, BioData Mining.

[69]  R. Knight,et al.  Supervised classification of human microbiota. , 2011, FEMS microbiology reviews.

[70]  Bin Zhou,et al.  Chemical and Biological Properties of Frequent Screening Hits , 2012, J. Chem. Inf. Model..

[71]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[72]  K. Schleifer,et al.  Phylogenetic identification and in situ detection of individual microbial cells without cultivation. , 1995, Microbiological reviews.

[73]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[74]  J. Handelsman,et al.  Introducing TreeClimber, a Test To Compare Microbial Community Structures , 2006, Applied and Environmental Microbiology.

[75]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[76]  Christos Levcopoulos,et al.  The First Subquadratic Algorithm for Complete Linkage Clustering , 1995, ISAAC.

[77]  Siu-Ming Yiu,et al.  A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio , 2011, Bioinform..

[78]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[79]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[80]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[81]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[82]  Ernest Szeto,et al.  Symbiosis insights through metagenomic analysis of a microbial consortium. , 2006, Nature Reviews Microbiology.

[83]  Christos Levcopoulos,et al.  Optimal Algorithms for Complete Linkage Clustering in d Dimensions , 2002, MFCS.

[84]  Zhenqiu Liu,et al.  Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data , 2011, Bioinform..

[85]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[86]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[87]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[88]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[89]  Jesse R. Zaneveld,et al.  Human-associated microbial signatures: examining their predictive value. , 2011, Cell host & microbe.

[90]  R. Knight,et al.  Species divergence and the measurement of microbial diversity. , 2008, FEMS microbiology reviews.

[91]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[92]  W. J. Jones High-Throughput Sequencing and Metagenomics , 2010 .

[93]  Man-Ling Lee,et al.  DISE: Directed Sphere Exclusion , 2003, J. Chem. Inf. Comput. Sci..

[94]  Fengzhu Sun,et al.  Variance adjusted weighted UniFrac: a powerful beta diversity measure for comparing communities based on phylogeny , 2011, BMC Bioinformatics.

[95]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[96]  John M. Barnard,et al.  Clustering Methods and Their Uses in Computational Chemistry , 2003 .

[97]  Fionn Murtagh,et al.  Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding , 2008, SIAM J. Sci. Comput..

[98]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[99]  Fionn Murtagh,et al.  Fast, linear time, m-adic hierarchical clustering for search and retrieval using the Baire metric, with linkages to generalized ultrametrics, hashing, formal concept analysis, and precision of data measurement , 2011, 1111.6254.

[100]  Raj Acharya,et al.  A two-way multi-dimensional mixture model for clustering metagenomic sequences , 2011, BCB '11.

[101]  Siu-Ming Yiu,et al.  MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[102]  Michael C Wendl,et al.  Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. , 2002, Genome research.

[103]  Siu-Ming Yiu,et al.  Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers , 2009, BMC Bioinformatics.

[104]  Eric P. Xing,et al.  Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[105]  Peer Bork,et al.  Discovering Functional Novelty in Metagenomes: Examples from Light-Mediated Processes , 2008, Journal of bacteriology.

[106]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[107]  Haixu Tang,et al.  Comparing Bacterial Communities Inferred from 16s Rrna Gene Sequencing and Shotgun Metagenomics , 2011, Pacific Symposium on Biocomputing.

[108]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[109]  Michael R. Thon,et al.  Supervised Protein Family Classification and New Family Construction , 2012, J. Comput. Biol..

[110]  Abhishek Sarkar,et al.  Split-Order Distance for Clustering and Classification Hierarchies , 2009, SSDBM.

[111]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[112]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[113]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[114]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[115]  S. Giovannoni,et al.  The uncultured microbial majority. , 2003, Annual review of microbiology.