Vector Quantized Spectral Clustering applied to Soybean Whole Genome Sequences

We develop a Vector Quantized Spectral Clustering (VQSC) algorithm that is a combination of Spectral Clustering (SC) and Vector Quantization (VQ) sampling for grouping Soybean genomes. The inspiration here is to use SC for its accuracy and VQ to make the algorithm computationally cheap (the complexity of SC is cubic in-terms of the input size). Although the combination of SC and VQ is not new, the novelty of our work is in developing the crucial similarity matrix in SC as well as use of k-medoids in VQ, both adapted for the Soybean genome data. We compare our approach with commonly used techniques like UPGMA (Un-weighted Pair Graph Method with Arithmetic Mean) and NJ (Neighbour Joining). Experimental results show that our approach outperforms both these techniques significantly in terms of cluster quality (up to 25% better cluster quality) and time complexity (order of magnitude faster).

[1]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[2]  D. Falush,et al.  SUPPLEMENTARY MATERIAL Similarity matrices and clustering algorithms for population identification using genetic data , 2012 .

[3]  Joshua T. Vogelstein,et al.  Covariate-assisted spectral clustering , 2014, Biometrika.

[4]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[5]  J. Felsenstein An alternating least squares approach to inferring phylogenies from pairwise distances. , 1997, Systematic biology.

[6]  Olivier Poch,et al.  Blast sampling for structural and functional analyses , 2007, BMC Bioinformatics.

[7]  S. Jeffery Evolution of Protein Molecules , 1979 .

[8]  Baohua Zhao,et al.  A Fast Spectral Clustering Method Based on Growing Vector Quantization for Large Data Sets , 2013, ADMA.

[9]  Hui Xiang,et al.  Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean , 2015, Nature Biotechnology.

[10]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[11]  Tae-Ho Lee,et al.  SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data , 2014, BMC Genomics.

[12]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[13]  Yves Tillé,et al.  Sampling Algorithms , 2011, International Encyclopedia of Statistical Science.

[14]  T. Sakurai,et al.  Genome sequence of the palaeopolyploid soybean , 2010, Nature.

[15]  Christophe Ambroise,et al.  SHIPS: Spectral Hierarchical Clustering for the Inference of Population Structure in Genetic Studies , 2012, PloS one.

[16]  Doron Betel,et al.  The microRNA.org resource: targets and expression , 2007, Nucleic Acids Res..

[17]  Zahir Tari,et al.  A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[18]  Gilad Lerman,et al.  Spectral Clustering Based on Local PCA , 2013, J. Mach. Learn. Res..

[19]  Kurt Jordaens,et al.  Multiple UPGMA and Neighbor-joining Trees and the Performance of Some Computer Packages , 1996 .

[20]  Wai-Ki Ching,et al.  Annotating gene functions with integrative spectral clustering on microarray expressions and sequences. , 2010, Genome informatics. International Conference on Genome Informatics.

[21]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..

[22]  Thomas Martinetz,et al.  PhyloMap: an algorithm for visualizing relationships of large sequence data sets and its application to the influenza A virus genome , 2011, BMC Bioinformatics.