A novel hierarchical clustering algorithm for gene sequences

BackgroundClustering DNA sequences into functional groups is an important problem in bioinformatics. We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. This method transforms DNA sequences into the feature vectors which contain the occurrence, location and order relation of k-tuples in DNA sequence. Afterwards, a hierarchical procedure is applied to clustering DNA sequences based on the feature vectors.ResultsThe proposed distance measure and clustering method are evaluated by clustering functionally related genes and by phylogenetic analysis. This method is also compared with BlastClust, CD-HIT-EST and some others. The experimental results show our method is effective in classifying DNA sequences with similar biological characteristics and in discovering the underlying relationship among the sequences.ConclusionsWe introduced a novel clustering algorithm which is based on a new sequence similarity measure. It is effective in classifying DNA sequences with similar biological characteristics and in discovering the relationship among the sequences.

[1]  Bernhard Haubold,et al.  Efficient estimation of pairwise distances between genomes , 2009, Bioinform..

[2]  Ernesto Picardi,et al.  EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data , 2009, BMC Bioinformatics.

[3]  Qi Dai,et al.  Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. , 2011, Journal of theoretical biology.

[4]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Daniel A. Ashlock,et al.  Classifying synthetic and biological DNA sequences with side effect machines , 2008, 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[6]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[7]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[8]  Chun Li,et al.  Relative entropy of DNA and its application , 2005 .

[9]  I. Uchiyama Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes , 2006, Nucleic acids research.

[10]  Brian Everitt,et al.  Cluster analysis , 1974 .

[11]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[12]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[13]  Susan R. Wilson INTRODUCTION TO COMPUTATIONAL BIOLOGY: MAPS, SEQUENCES AND GENOMES. , 1996 .

[14]  Mihai Pop,et al.  Alignment and clustering of phylogenetic markers - implications for microbial diversity studies , 2010, BMC Bioinformatics.

[15]  T. Mahalakshmi,et al.  Visualization Of Genomic Data Using Inter-Nucleotide Distance Signals , 2005 .

[16]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[17]  Andrei V. Kelarev,et al.  Clustering Algorithms for ITS Sequence Data with Alignment Metrics , 2006, Australian Conference on Artificial Intelligence.

[18]  Paul A. Gore,et al.  11 – Cluster Analysis , 2000 .

[19]  Susan M. Bridges,et al.  Interactive clustering for exploration of genomic data , 2002 .

[20]  S. Pääbo,et al.  Conflict Among Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders , 1998, Journal of Molecular Evolution.

[21]  Christian Gautier,et al.  Statistical method for predicting protein coding regions in nucleic acid sequences , 1987, Comput. Appl. Biosci..

[22]  Shengrui Wang,et al.  CLUSS: Clustering of protein sequences based on a new similarity measure , 2007, BMC Bioinformatics.

[23]  Elon Portugaly,et al.  Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space , 2008, ISMB.

[24]  Qingshan Jiang,et al.  A DNA sequence distance measure approach for phylogenetic tree construction , 2010, 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA).

[25]  Tiee-Jian Wu,et al.  Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition , 2001, Biometrics.

[26]  George C. Tseng,et al.  Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data , 2007, Bioinform..

[27]  Ting Chen,et al.  Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering , 2011, Bioinform..

[28]  Xiao Sun,et al.  A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping. , 2008, Biochemical and Biophysical Research Communications - BBRC.

[29]  P. Chaudhuri,et al.  SWORDS: A statistical tool for analysing large DNA sequences , 2002, Journal of Biosciences.

[30]  FayyadUsama,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005 .

[31]  Bernhard Haubold,et al.  Alignment-free detection of local similarity among viral and bacterial genomes , 2011, Bioinform..

[32]  Koichi Nishigaki,et al.  A mathematical consideration of the word-composition vector method in comparison of biological sequences , 2011, Biosyst..

[33]  Eva M. Neumann-Held The Gene Is Dead — Long Live the Gene! Conceptualizing Genes the Constructionist Way , 1999 .

[34]  Magnus Rattray,et al.  A Methodology for Comparative Functional Genomics , 2007, J. Integr. Bioinform..

[35]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[36]  Aibing Zhang,et al.  New method for comparing DNA primary sequences based on a discrimination measure , 2010, Journal of Theoretical Biology.

[37]  BMC Bioinformatics , 2005 .

[38]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[39]  Xiang Fang,et al.  An improved string composition method for sequence comparison , 2008, BMC Bioinformatics.

[40]  A. S.,et al.  Estimating the Entropy of DNA Sequences , 1997 .

[41]  Kuan Yang,et al.  Performance comparison of gene family clustering methods with expert curated gene family data set in Arabidopsis thaliana , 2008, Planta.

[42]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[43]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[44]  Stephen S-T Yau,et al.  A new distribution vector and its application in genome clustering. , 2011, Molecular phylogenetics and evolution.

[45]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[46]  Sokal Rr,et al.  Biometry: the principles and practice of statistics in biological research 2nd edition. , 1981 .

[47]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[48]  Mohamed S. Kamel,et al.  Enhanced bisecting k-means clustering using intermediate cooperation , 2009, Pattern Recognit..

[49]  Bernhard Haubold,et al.  Alignment-free estimation of nucleotide diversity , 2011, Bioinform..

[50]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[51]  Jian Pei,et al.  Classification, Clustering, Features and Distances of Sequence Data , 2007 .

[52]  Bo Zhao,et al.  A novel clustering method via nucleotide-based Fourier power spectrum analysis , 2011, Journal of Theoretical Biology.

[53]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[54]  Jun Wang,et al.  A Poisson model of sequence comparison and its application to coronavirus phylogeny , 2009, Mathematical Biosciences.

[55]  Peter Sperisen,et al.  JACOP: A simple and robust method for the automated classification of protein sequences with modular architecture , 2005, BMC Bioinformatics.

[56]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[57]  Roderic D. M. Page,et al.  TreeView: an application to display phylogenetic trees on personal computers , 1996, Comput. Appl. Biosci..

[58]  Jun Wang,et al.  WSE, a new sequence distance measure based on word frequencies , 2008, Mathematical Biosciences.

[59]  Jeffery P. Demuth,et al.  The Evolution of Mammalian Gene Families , 2006, PloS one.

[60]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[61]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[62]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .