论文信息 - Model-based clustering of array CGH data

Model-based clustering of array CGH data

Motivation: Analysis of array comparative genomic hybridization (aCGH) data for recurrent DNA copy number alterations from a cohort of patients can yield distinct sets of molecular signatures or profiles. This can be due to the presence of heterogeneous cancer subtypes within a supposedly homogeneous population. Results: We propose a novel statistical method for automatically detecting such subtypes or clusters. Our approach is model based: each cluster is defined in terms of a sparse profile, which contains the locations of unusually frequent alterations. The profile is represented as a hidden Markov model. Samples are assigned to clusters based on their similarity to the cluster's profile. We simultaneously infer the cluster assignments and the cluster profiles using an expectation maximization-like algorithm. We show, using a realistic simulation study, that our method is significantly more accurate than standard clustering techniques. We then apply our method to two clinical datasets. In particular, we examine previously reported aCGH data from a cohort of 106 follicular lymphoma patients, and discover clusters that are known to correspond to clinically relevant subgroups. In addition, we examine a cohort of 92 diffuse large B-cell lymphoma patients, and discover previously unreported clusters of biological interest which have inspired followup clinical research on an independent cohort. Availability: Software and synthetic datasets are available at http://www.cs.ubc.ca/∼sshah/acgh as part of the CNA-HMMer package. Contact: sshah@bccrc.ca Supplementary information: Supplementary data are available at Bioinformatics online.

[1] Kevin P. Murphy,et al. Integrating copy number polymorphisms into array CGH analysis using a robust HMM , 2006, ISMB.

[2] J. Besag. On the Statistical Analysis of Dirty Pictures , 1986 .

[3] Anil K. Jain,et al. Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] C. Yau,et al. QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data , 2007, Nucleic acids research.

[5] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6] David Haussler,et al. Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[7] L. Chin,et al. High-resolution genomic profiles of human lung cancer. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8] Francis S. Collins,et al. Mapping the cancer genome , 2007 .

[9] V. Koneti Rao,et al. Taking ALPS down a Notch , 2008 .

[10] Emmanuel Barillot,et al. BAC array CGH distinguishes mutually exclusive alterations that define clinicogenetic subtypes of gliomas , 2007, International journal of cancer.

[11] L. Staudt,et al. Diffuse large B-cell lymphoma subgroups have distinct genetic profiles that influence tumor biology and improve gene-expression-based survival prediction. , 2005, Blood.

[12] S. Tavaré,et al. High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer , 2007, Genome Biology.

[13] Jeroen de Ridder,et al. Identification of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array CGH data , 2008, Nucleic acids research.

[14] Cédric Archambeau,et al. Probabilistic models in noisy environments : and their application to a visual prosthesis for the blind/ , 2005 .

[15] Vipin Kumar,et al. Introduction to Data Mining, (First Edition) , 2005 .

[16] L. Chin,et al. High-resolution characterization of the pancreatic adenocarcinoma genome , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17] Sohrab P. Shah,et al. Deletion in Chromosome 17p12 and Gains in Chromosome 9q33.3 by Array Comparative Hybridization Are Associated with R-CHOP Treatment Failure in Patients with Diffuse Large B Cell Lymphoma , 2008 .

[18] Frank Speleman,et al. ArrayCGH‐based classification of neuroblastoma into genomic subgroups , 2007, Genes, chromosomes & cancer.

[19] Randy D Gascoyne,et al. Comprehensive whole genome array CGH profiling of mantle cell lymphoma model genomes. , 2004, Human molecular genetics.