Estimating the number of species in metagenomes by clustering next-generation read sequences

Fast and cheap next-generation sequencing (NGS) technologies with the ability to sequence uncultured microbes present us unprecedented opportunities to distill meaningful information from millions of short read sequences of metagenomes. Contrary to the case of a single species genome, NGS read sequences from metagenomes are extremely complex and heterogeneous because metagenomes are a collection of genetic materials from very large number of microbes with varying abundance levels. In this paper we present a method to estimate the number of species in metagenomes sequences through the efficient clustering of metagenomic NGS read sequences. We believe that our method will contribute to the better understanding of a microbial community in metagenomes.