论文信息 - Some Clustering and Classification Problems in High-Throughput Metagenomics and Cheminformatics

Some Clustering and Classification Problems in High-Throughput Metagenomics and Cheminformatics

In this dissertation, we address three different problems in high-throughput metagenomics and cheminformatics.(1) Metagenomics studies the genomic content of an entire microbial community by simultaneously sequencing all genomes in an environmental sample. The advent of next-generation sequencing (NGS) technologies has drastically reduced sequencing time and cost, leading to the generation of millions of sequences (reads) in a single run. An important problem in metagenomic analysis is to determine and quantify species (or genomes) in a metagenomic sample. The problem is challenging due to an unknown number of genomes and their abundance ratios, presence of repeats and sequencing errors, and the short length of NGS reads. We propose two algorithms to address these challenges. First, we present an algorithm for separating short paired-end reads from genomes with similar abundance levels. Second, we propose a method to accurately estimate the abundance levels of species. The algorithm automatically determines the number of abundance groups in a metagenomic dataset and bins the reads into these groups.(2) NGS coupled with metagenomics has led to the rapid growth of sequence databases and enabled a new branch of microbiology called comparative metagenomics. It is a fast growing field that requires the development of novel supervised learning techniques. In particular, the problem of microbial community classification may have useful applications enabling efficient organization and search in rapidly growing metagenomic databases, detection of disease phenotypes in clinical samples, and forensic identification. We propose a novel supervised classification method for metagenomic samples that takes advantage of the natural structure in microbial community data encoded by a phylogenetic tree.(3) In modern drug discovery, ultra-high-throughput screening is applied to millions of drug-like compounds in one experiment. Hierarchical clustering is an important step in the drug discovery process. Standard implementations of the exact algorithm for hierarchical clustering require O(n 2 ) time and O(n 2 ) memory. Even though approximate hierarchical clustering methods overcome this problem, they either rely on embedding into spaces that are not biologically sensible, or produce very low resolution hierarchical structures. We present a hybrid hierarchical clustering algorithm requiring approximately O(n sqrt(n)) time and O(n sqrt(n)) memory while still preserving the most desirable properties of the exact algorithm.

Olga Tanaseichuk | Olga Tanaseichuk

[1] Huzefa Rangwala,et al. Evaluation of short read metagenomic assembly , 2011, BMC Genomics.

[2] Phat L Tran,et al. Metabolic Complementarity and Genomics of the Dual Bacterial Symbiosis of Sharpshooters , 2006, PLoS biology.

[3] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[4] J. Hemmer-Hansen,et al. Application of SNPs for population genetics of nonmodel organisms: new opportunities and challenges , 2011, Molecular ecology resources.

[5] Rob Knight,et al. PyNAST: a flexible tool for aligning sequences to a template alignment , 2009, Bioinform..

[6] P. Zhao,et al. The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[7] Jean-Philippe Vert,et al. Group lasso with overlap and graph lasso , 2009, ICML '09.

[8] Mark J. P. Chaisson,et al. Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[9] D. Botstein,et al. Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[10] William M. Rand,et al. Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[11] Siu-Ming Yiu,et al. MetaCluster 4.0: A Novel Binning Algorithm for NGS Reads and Huge Number of Species , 2012, J. Comput. Biol..