Performance Comparison of Clustering Methods for Gene Family Data

Abstract. Clustering gene sequences into families is important for understanding and predicting gene function. Many clustering algorithms and alignment-free similarity measures have been used to analyze gene family. The clustering results can be influenced by the similarity measure and clustering algorithm used. We compare the results from running four commonly used clustering methods, including K-means, single-linkage clustering, completelinkage clustering and average-linkage clustering, on three alignment-free similarity measures. We try to find out which method should provide the best clustering result based on real-world gene family datasets. Experiment results show that average-linkage clustering with our similarity measure, DMk, performed best. Keywords: Gene family,

[1]  Libin Liu,et al.  Clustering DNA sequences by feature vectors. , 2006, Molecular phylogenetics and evolution.

[2]  George C. Tseng,et al.  Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data , 2007, Bioinform..

[3]  Abdul Sattar,et al.  AI 2006: Advances in Artificial Intelligence, 19th Australian Joint Conference on Artificial Intelligence, Hobart, Australia, December 4-8, 2006, Proceedings , 2006, Australian Conference on Artificial Intelligence.

[4]  Kuan Yang,et al.  Performance comparison of gene family clustering methods with expert curated gene family data set in Arabidopsis thaliana , 2008, Planta.

[5]  M. Gouy,et al.  Hovergen: Comparative Analysis of Homologous Vertebrate Genes , 2002 .

[6]  Jeffery P. Demuth,et al.  The Evolution of Mammalian Gene Families , 2006, PloS one.

[7]  Qingshan Jiang,et al.  A DNA sequence distance measure approach for phylogenetic tree construction , 2010, 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA).

[8]  P. Chaudhuri,et al.  SWORDS: A statistical tool for analysing large DNA sequences , 2002, Journal of Biosciences.

[9]  Stanley Letovsky,et al.  Bioinformatics: Databases and Systems , 2013, Springer US.

[10]  Peter Sperisen,et al.  JACOP: A simple and robust method for the automated classification of protein sequences with modular architecture , 2005, BMC Bioinformatics.

[11]  Andrei V. Kelarev,et al.  Clustering Algorithms for ITS Sequence Data with Alignment Metrics , 2006, Australian Conference on Artificial Intelligence.

[12]  Evgenia V. Kriventseva,et al.  Classification of proteins by clustering techniques , 2005 .

[13]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[14]  Jian Pei,et al.  Classification, Clustering, Features and Distances of Sequence Data , 2007 .

[15]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[16]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.