Identifying and removing artificial replicates from 454 pyrosequencing data.
暂无分享,去创建一个
An intrinsic artifact of 454-based pyrosequencing leads to artificial overrepresentation of >10% of the original DNA sequencing templates. This artificial amplification of sequences is unbiased with regard to position on the pyrosequencing plate or sequence identity, and it occurs in all currently available 454 technologies. The amplified sequences start at the same position and are identical (duplicates), or vary in length, or contain a sequencing discrepancy. If the abundance of any sequence in a data set is going to be enumerated, either for comparative community analysis, transcriptional analysis or other applications, it is important to remove these artificial replicates before analysis. A web-based tool that incorporates the clustering algorithm cd-hit was developed to identify and remove artificially replicated sequences in 454-based pyrosequencing data sets. This tool cannot be used for data sets that have an initial amplification step before the standard pyrosequencing procedure, because artificial replicates cannot be distinguished from expected replication due to polymerase chain reaction (PCR) amplification, e.g., in sequencing of amplified gene "tags." This protocol provides details on how to use the replicate filter and obtain a file of unique sequences for use in metagenomic or transcriptomic analyses.
[1] Tracy K. Teal,et al. Systematic artifacts in metagenomes from complex microbial communities , 2009, The ISME Journal.
[2] Adam Godzik,et al. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..
[3] Susan M. Huse,et al. Microbial diversity in the deep sea and the underexplored “rare biosphere” , 2006, Proceedings of the National Academy of Sciences.