Sequence clustering in bioinformatics: an empirical study.

Sequence clustering is a basic bioinformatics task that is attracting renewed attention with the development of metagenomics and microbiomics. The latest sequencing techniques have decreased costs and as a result, massive amounts of DNA/RNA sequences are being produced. The challenge is to cluster the sequence data using stable, quick and accurate methods. For microbiome sequencing data, 16S ribosomal RNA operational taxonomic units are typically used. However, there is often a gap between algorithm developers and bioinformatics users. Different software tools can produce diverse results and users can find them difficult to analyze. Understanding the different clustering mechanisms is crucial to understanding the results that they produce. In this review, we selected several popular clustering tools, briefly explained the key computing principles, analyzed their characters and compared them using two independent benchmark datasets. Our aim is to assist bioinformatics users in employing suitable clustering tools effectively to analyze big sequencing data. Related data, codes and software tools were accessible at the link http://lab.malab.cn/∼lg/clustering/.

[1]  Q. Zou,et al.  Protein Folds Prediction with Hierarchical Structured SVM , 2016 .

[2]  R. Edgar SEARCH_16S: A new algorithm for identifying 16S ribosomal RNA genes in contigs and chromosomes , 2017, bioRxiv.

[3]  Sarah L. Westcott,et al.  De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units , 2015, PeerJ.

[4]  Jane You,et al.  Double Selection Based Semi-Supervised Clustering Ensemble for Tumor Clustering from Gene Expression Profiles , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Ting Chen,et al.  Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering , 2011, Bioinform..

[6]  Alice Carolyn McHardy,et al.  Taxonomic binning of metagenome samples generated by next-generation sequencing technologies , 2012, Briefings Bioinform..

[7]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[8]  Xiangxiang Zeng,et al.  nDNA-prot: identification of DNA-binding proteins based on unbalanced classification , 2014, BMC Bioinformatics.

[9]  C Y Wang,et al.  imDC: an ensemble learning method for imbalanced classification with miRNA data. , 2015, Genetics and molecular research : GMR.

[10]  Yu Zhang,et al.  QUBIC: a bioconductor package for qualitative biclustering analysis of gene co‐expression data , 2016, Bioinform..

[11]  M. Thomas P. Gilbert,et al.  Environmental genes and genomes: understanding the differences and challenges in the approaches and software for their analyses , 2015, Briefings Bioinform..

[12]  Robert C. Edgar,et al.  Updating the 97% identity threshold for 16S ribosomal RNA OTUs , 2017, bioRxiv.

[13]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[14]  Emily R. Davenport,et al.  Heritable components of the human fecal microbiome are associated with visceral fat , 2016, Genome Biology.

[15]  Shengrui Wang,et al.  A new method for decontamination of de novo transcriptomes using a hierarchical clustering algorithm , 2016, Bioinform..

[16]  Hua Tang,et al.  IonchanPred 2.0: A Tool to Predict Ion Channels and Their Types , 2017, International journal of molecular sciences.

[17]  Xiangxiang Zeng,et al.  Reconstructing evolutionary trees in parallel for massive sequences , 2017, BMC Systems Biology.

[18]  K. Chou,et al.  iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. , 2013, Analytical biochemistry.

[19]  Bin Liu,et al.  Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences , 2017 .

[20]  Frédéric Mahé,et al.  Swarm: robust and fast clustering method for amplicon-based studies , 2014, PeerJ.

[21]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[22]  Rob Knight,et al.  The Earth Microbiome project: successes and aspirations , 2014, BMC Biology.

[23]  Quan Zou,et al.  HPSLPred: An Ensemble Multi‐Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source , 2017, Proteomics.

[24]  Yongmei Cheng,et al.  A Comparison of Methods for Clustering 16S rRNA Sequences into OTUs , 2013, PloS one.

[25]  Xiaoyu Wang,et al.  A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis , 2012, Briefings Bioinform..

[26]  Catherine Ngom-Bru,et al.  Gut microbiota: methodological aspects to describe taxonomy and functionality , 2012, Briefings Bioinform..

[27]  Wei Chen,et al.  Recent Advances in Conotoxin Classification by Using Machine Learning Methods , 2017, Molecules.

[28]  Wei Chen,et al.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions , 2015, Bioinform..

[29]  Hua Tang,et al.  Identify and analysis crotonylation sites in histone by using support vector machines , 2017, Artif. Intell. Medicine.

[30]  P. Schloss Secondary structure improves OTU assignments of 16S rRNA gene sequences , 2012, The ISME Journal.

[31]  H. Neve,et al.  Optimizing protocols for extraction of bacteriophages prior to metagenomic analyses of phage communities in the human gut , 2015, Microbiome.

[32]  Frank Oliver Glöckner,et al.  Current opportunities and challenges in microbial metagenome analysis—a bioinformatic perspective , 2012, Briefings Bioinform..

[33]  J. Aerts,et al.  SCENIC: Single-cell regulatory network inference and clustering , 2017, Nature Methods.

[34]  Shuang Li,et al.  SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity , 2016, PloS one.

[35]  Quan Zou,et al.  Exploratory Predicting Protein Folding Model with Random Forest and Hybrid Features , 2014 .

[36]  Robert C. Edgar,et al.  UPARSE: highly accurate OTU sequences from microbial amplicon reads , 2013, Nature Methods.

[37]  Jullien M. Flynn,et al.  Toward accurate molecular identification of species in complex environmental samples: testing the performance of sequence filtering and clustering methods , 2015, Ecology and evolution.

[38]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[39]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[40]  Wei Zheng,et al.  ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time , 2017, PLoS Comput. Biol..

[41]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[42]  Wei Chen,et al.  Predicting Human Enzyme Family Classes by Using Pseudo Amino Acid Composition , 2016 .

[43]  N. Kyrpides,et al.  Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample , 2012, PloS one.

[44]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[45]  Yong Huang,et al.  Identifying Multi-Functional Enzyme by Hierarchical Multi-Label Classifier , 2013 .

[46]  Ke Chen,et al.  Survey of MapReduce frame operation in bioinformatics , 2013, Briefings Bioinform..

[47]  Paul C. Boutros,et al.  Unsupervised pattern recognition: An introduction to the whys and wherefores of clustering microarray data , 2005, Briefings Bioinform..

[48]  Dong Wang,et al.  iLoc‐lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC , 2018, Bioinform..

[49]  Xiangke Liao,et al.  Multiple Sequence Alignment Based on a Suffix Tree and Center-Star Strategy: A Linear Method for Multiple Nucleotide Sequence Alignment on Spark Parallel Framework , 2017, J. Comput. Biol..

[50]  Nicholas A. Bokulich,et al.  mockrobiota: a Public Resource for Microbiome Bioinformatics Benchmarking , 2016, mSystems.

[51]  Quan Zou,et al.  HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing , 2017, Algorithms for Molecular Biology.

[52]  John C. Wooley,et al.  Ultrafast clustering algorithms for metagenomic sequence analysis , 2012, Briefings Bioinform..

[53]  Hao Lin,et al.  Predicting the Organelle Location of Noncoding RNAs Using Pseudo Nucleotide Compositions , 2017, Interdisciplinary Sciences: Computational Life Sciences.

[54]  Xiaoyu Wang,et al.  M-pick, a modularity-based method for OTU picking of 16S rRNA sequences , 2013, BMC Bioinformatics.

[55]  Ben Nichols,et al.  VSEARCH: a versatile open source tool for metagenomics , 2016, PeerJ.

[56]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[57]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[58]  Qinghua Hu,et al.  HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy , 2015, Bioinform..

[59]  A. Bashir,et al.  Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering , 2015, Microbiome.

[60]  Juan Wang,et al.  A review of metrics measuring dissimilarity for rooted phylogenetic networks , 2019, Briefings Bioinform..

[61]  Michael Q. Zhang,et al.  Network embedding-based representation learning for single cell RNA-seq data , 2017, Nucleic acids research.