Subset Clustering of Binary Sequences, with an Application to Genomic Abnormality Data

Summary This article develops a model‐based approach to clustering multivariate binary data, in which the attributes that distinguish a cluster from the rest of the population may depend on the cluster being considered. The clustering approach is based on a multivariate Dirichlet process mixture model, which allows for the estimation of the number of clusters, the cluster memberships, and the cluster‐specific parameters in a unified way. Such a clustering approach has applications in the analysis of genomic abnormality data, in which the development of different types of tumors may depend on the presence of certain abnormalities at subsets of locations along the genome. Additionally, such a mixture model provides a nonparametric estimation scheme for dependent sequences of binary data.

[1]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[2]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[3]  A. Tsiatis,et al.  Statistical analysis of cytogenetic abnormalities in human cancer cells. , 1982, Cancer genetics and cytogenetics.

[4]  G. W. Milligan,et al.  The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  M. Newton,et al.  Assessing the significance of chromosome-loss data: where are suppressor genes for bladder cancer? , 1994, Statistics in medicine.

[6]  S. MacEachern Estimating normal means with a conjugate style dirichlet process prior , 1994 .

[7]  L. Wasserman,et al.  A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion , 1995 .

[8]  Michael A. West,et al.  Computing Nonparametric Hierarchical Models , 1998 .

[9]  Feng Jiang,et al.  Inferring Tree Models for Oncogenesis from Comparative Genome Hybridization Data , 1999, J. Comput. Biol..

[10]  A. Schäffer,et al.  Construction of evolutionary tree models for renal cell carcinoma from comparative genomic hybridization data. , 2000, Cancer research.

[11]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[12]  M. Newton Discovering Combinations of Genomic Aberrations Associated With Cancer , 2002 .

[13]  Marina MeWi Comparing Clusterings , 2002 .

[14]  J. Friedman Clustering objects on subsets of attributes , 2002 .

[15]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[16]  D. B. Dahl An improved merge-split sampler for conjugate dirichlet process mixture models , 2003 .

[17]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[18]  H. Mannila,et al.  Subspace Clustering of Binary Data - A Probabilistic Approach , 2004 .

[19]  Huan Liu,et al.  Evaluating Subspace Clustering Algorithms , 2004 .

[20]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[21]  Anne Patrikainen,et al.  Subspace clustering of high-dimensional bi-nary data-a probabilistic approach , 2004 .

[22]  Peter D. Ho Clustering based on Dirichlet mixtures of attribute ensembles , 2004 .