Modeling centre-based hard and soft clustering for Y chromosome short tandem repeats (YSTR) data

This paper models: (1) Y-STR data and; (2) Y-STR hard and soft clustering. The Y-STR models are extended and developed to test on three data sets of Y-STR haplogroup and Y-STR Surname. The results show that the hard clustering models and the soft clustering models have their advantages and disadvantages. The soft k-Means model produces a good clustering accuracy of 99.62% for Y-STR haplogroup data, whereas the hard k-Medoids obtains the highest score of clustering accuracy of 99.90% for Y-STR Surname data. This scenario seems to be both models have an equally chance of improving Y-STR clustering performances.

[1]  Robert G Cowell,et al.  A clustering algorithm using DNA marker information for sub-pedigree reconstruction. , 2003, Journal of forensic sciences.

[2]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[3]  Yutao Fu,et al.  Gene expression module discovery using gibbs sampling. , 2004, Genome informatics. International Conference on Genome Informatics.

[4]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Enrique H. Ruspini,et al.  A New Approach to Clustering , 1969, Inf. Control..

[6]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[7]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[8]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[9]  Shigeto Seno,et al.  P-quasi complete linkage analysis for gene-expression data , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[10]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[11]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[12]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  B. May,et al.  A graph‐theoretic approach to the partition of individuals into full‐sib families , 2003, Molecular ecology.

[15]  Jason Lee,et al.  BAG: a graph theoretic sequence clustering algorithm , 2006, Int. J. Data Min. Bioinform..

[16]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[17]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[18]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[19]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Kathleen Marchal,et al.  Functional bioinformatics of microarray data: from expression to regulation , 2002, Proc. IEEE.

[21]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[22]  Roded Sharan,et al.  CLICK: A Clustering Algorithm for Gene Expression Analysis , 2000, ISMB 2000.

[23]  D. Botstein,et al.  A gene expression database for the molecular pharmacology of cancer , 2000, Nature Genetics.