论文信息 - Modeling centre-based hard and soft clustering for Y chromosome short tandem repeats (YSTR) data

Modeling centre-based hard and soft clustering for Y chromosome short tandem repeats (YSTR) data

This paper models: (1) Y-STR data and; (2) Y-STR hard and soft clustering. The Y-STR models are extended and developed to test on three data sets of Y-STR haplogroup and Y-STR Surname. The results show that the hard clustering models and the soft clustering models have their advantages and disadvantages. The soft k-Means model produces a good clustering accuracy of 99.62% for Y-STR haplogroup data, whereas the hard k-Medoids obtains the highest score of clustering accuracy of 99.90% for Y-STR Surname data. This scenario seems to be both models have an equally chance of improving Y-STR clustering performances.

Zainab Abu Bakar | Ali Seman | Azizian Mohd. Sapawi

[1] Robert G Cowell,et al. A clustering algorithm using DNA marker information for sub-pedigree reconstruction. , 2003, Journal of forensic sciences.

[2] Peter J. Rousseeuw,et al. Clustering by means of medoids , 1987 .

[3] Yutao Fu,et al. Gene expression module discovery using gibbs sampling. , 2004, Genome informatics. International Conference on Genome Informatics.

[4] D. Botstein,et al. Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5] Enrique H. Ruspini,et al. A New Approach to Clustering , 1969, Inf. Control..

[6] Ali S. Hadi,et al. Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[7] R. Stephenson. A and V , 1962, The British journal of ophthalmology.

[8] Joshua Zhexue Huang,et al. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[9] Shigeto Seno,et al. P-quasi complete linkage analysis for gene-expression data , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[10] Michael K. Ng,et al. A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[11] Ash A. Alizadeh,et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[12] Vipin Kumar,et al. The Challenges of Clustering High Dimensional Data , 2004 .

[13] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[14] B. May,et al. A graph‐theoretic approach to the partition of individuals into full‐sib families , 2003, Molecular ecology.

[15] Jason Lee,et al. BAG: a graph theoretic sequence clustering algorithm , 2006, Int. J. Data Min. Bioinform..

[16] Joshua Zhexue Huang,et al. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[17] James C. Bezdek,et al. Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[18] Zohar Yakhini,et al. Clustering gene expression patterns , 1999, J. Comput. Biol..

[19] Gerardo Beni,et al. A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[20] Kathleen Marchal,et al. Functional bioinformatics of microarray data: from expression to regulation , 2002, Proc. IEEE.

[21] Roded Sharan,et al. Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[22] Roded Sharan,et al. CLICK: A Clustering Algorithm for Gene Expression Analysis , 2000, ISMB 2000.

[23] D. Botstein,et al. A gene expression database for the molecular pharmacology of cancer , 2000, Nature Genetics.