Model-based clustering of array CGH data

Motivation: Analysis of array comparative genomic hybridization (aCGH) data for recurrent DNA copy number alterations from a cohort of patients can yield distinct sets of molecular signatures or profiles. This can be due to the presence of heterogeneous cancer subtypes within a supposedly homogeneous population. Results: We propose a novel statistical method for automatically detecting such subtypes or clusters. Our approach is model based: each cluster is defined in terms of a sparse profile, which contains the locations of unusually frequent alterations. The profile is represented as a hidden Markov model. Samples are assigned to clusters based on their similarity to the cluster's profile. We simultaneously infer the cluster assignments and the cluster profiles using an expectation maximization-like algorithm. We show, using a realistic simulation study, that our method is significantly more accurate than standard clustering techniques. We then apply our method to two clinical datasets. In particular, we examine previously reported aCGH data from a cohort of 106 follicular lymphoma patients, and discover clusters that are known to correspond to clinically relevant subgroups. In addition, we examine a cohort of 92 diffuse large B-cell lymphoma patients, and discover previously unreported clusters of biological interest which have inspired followup clinical research on an independent cohort. Availability: Software and synthetic datasets are available at http://www.cs.ubc.ca/∼sshah/acgh as part of the CNA-HMMer package. Contact: sshah@bccrc.ca Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Kevin P. Murphy,et al.  Integrating copy number polymorphisms into array CGH analysis using a robust HMM , 2006, ISMB.

[2]  J. Besag On the Statistical Analysis of Dirty Pictures , 1986 .

[3]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  C. Yau,et al.  QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data , 2007, Nucleic acids research.

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[7]  L. Chin,et al.  High-resolution genomic profiles of human lung cancer. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Francis S. Collins,et al.  Mapping the cancer genome , 2007 .

[9]  V. Koneti Rao,et al.  Taking ALPS down a Notch , 2008 .

[10]  Emmanuel Barillot,et al.  BAC array CGH distinguishes mutually exclusive alterations that define clinicogenetic subtypes of gliomas , 2007, International journal of cancer.

[11]  L. Staudt,et al.  Diffuse large B-cell lymphoma subgroups have distinct genetic profiles that influence tumor biology and improve gene-expression-based survival prediction. , 2005, Blood.

[12]  S. Tavaré,et al.  High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer , 2007, Genome Biology.

[13]  Jeroen de Ridder,et al.  Identification of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array CGH data , 2008, Nucleic acids research.

[14]  Cédric Archambeau,et al.  Probabilistic models in noisy environments : and their application to a visual prosthesis for the blind/ , 2005 .

[15]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[16]  L. Chin,et al.  High-resolution characterization of the pancreatic adenocarcinoma genome , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Sohrab P. Shah,et al.  Deletion in Chromosome 17p12 and Gains in Chromosome 9q33.3 by Array Comparative Hybridization Are Associated with R-CHOP Treatment Failure in Patients with Diffuse Large B Cell Lymphoma , 2008 .

[18]  Frank Speleman,et al.  ArrayCGH‐based classification of neuroblastoma into genomic subgroups , 2007, Genes, chromosomes & cancer.

[19]  Randy D Gascoyne,et al.  Comprehensive whole genome array CGH profiling of mantle cell lymphoma model genomes. , 2004, Human molecular genetics.

[20]  Jeffrey E. Barrick,et al.  The power of riboswitches. , 2007, Scientific American.

[21]  Joe W. Gray,et al.  Translating insights from the cancer genome into clinical practice , 2008, Nature.

[22]  M. J. van der Laan,et al.  A new partitioning around medoids algorithm , 2003 .

[23]  I. Jacobs,et al.  Genetic intra‐tumour heterogeneity in epithelial ovarian cancer and its implications for molecular diagnosis of tumours , 2007, The Journal of pathology.

[24]  J. Schimenti,et al.  Synapsis or silence , 2005, Nature Genetics.

[25]  Therese Sørlie,et al.  Molecular portraits of breast cancer: tumour subtypes as distinct disease entities. , 2004, European Journal of Cancer.

[26]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[27]  Christian Steidl,et al.  Genome-wide profiling of follicular lymphoma by array comparative genomic hybridization reveals prognostically significant DNA copy number imbalances. , 2009, Blood.

[28]  D. Pinkel,et al.  Array comparative genomic hybridization and its applications in cancer , 2005, Nature Genetics.

[29]  Bradley P. Coe,et al.  A tiling resolution DNA microarray with complete coverage of the human genome , 2004, Nature Genetics.

[30]  Wessel N van Wieringen,et al.  Nonparametric Testing for DNA Copy Number Induced Differential mRNA Gene Expression , 2009, Biometrics.

[31]  Francis S Collins,et al.  Mapping the cancer genome. Pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies. , 2007, Scientific American.

[32]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[33]  Kevin P. Murphy,et al.  Modeling recurrent DNA copy number alterations in array CGH data , 2007, ISMB/ECCB.

[34]  Christian J Stoeckert,et al.  STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. , 2006, Genome research.

[35]  Sanjay Ranka,et al.  Markers improve clustering of CGH data , 2007, Bioinform..

[36]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[37]  Ingo Ruczinski,et al.  Hidden Markov models for the assessment of chromosomal alterations using high-throughput SNP arrays. , 2008, The annals of applied statistics.

[38]  Céline Rouveirol,et al.  Bioinformatics Original Paper Computation of Recurrent Minimal Genomic Alterations from Array-cgh Data , 2022 .

[39]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[40]  Mattias Höglund,et al.  Identification of cytogenetic subgroups and karyotypic pathways of clonal evolution in follicular lymphomas , 2004, Genes, chromosomes & cancer.

[41]  Adrian Wiestner,et al.  A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma , 2003, Proceedings of the National Academy of Sciences of the United States of America.