Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields

Motivation: The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome. Results: Here, we present the heterogeneous hidden conditional random field, a new integrated array-CGH analysis method for jointly classifying tumors, inferring copy numbers and identifying clinically relevant positions in recurrent alteration regions. By capturing the sequentiality as well as the locality of changes, our integrated model provides better noise reduction, and achieves more relevant gene retrieval and more accurate classification than existing methods. We provide an efficient L1-regularized discriminative training algorithm, which notably selects a small set of candidate genes most likely to be clinically relevant and driving the recurrent amplicons of importance. Our method thus provides unbiased starting points in deciding which genomic regions and which genes in particular to pursue for further examination. Our experiments on synthetic data and real genomic cancer prediction data show that our method is superior, both in prediction accuracy and relevant feature discovery, to existing methods. We also demonstrate that it can be used to generate novel biological hypotheses for breast cancer. Contact: ogt@cs.princeton.edu Supplementary information:Supplementary data are available at Bioinformatics online.

[1]  Peter J. Park,et al.  Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data , 2005, Bioinform..

[2]  D. Albertson,et al.  Gene amplification in cancer. , 2006, Trends in genetics : TIG.

[3]  P. Raychaudhuri,et al.  Cul 4 A Physically Associates with MDM 2 and Participates in the Proteolysis of p 53 , 2004 .

[4]  Marcel J T Reinders,et al.  Molecular classification of breast carcinomas by comparative genomic hybridization: a specific somatic genetic profile for BRCA1 tumors. , 2002, Cancer research.

[5]  J. Fridlyand,et al.  Rare amplicons implicate frequent deregulation of cell fate specification pathways in oral squamous cell carcinoma , 2005, Oncogene.

[6]  David M. Rocke,et al.  A Model for Measurement Error for Gene Expression Arrays , 2001, J. Comput. Biol..

[7]  Yu Li,et al.  Genomic and functional evidence for an ARID1A tumor suppressor role , 2007, Genes, chromosomes & cancer.

[8]  Emmanuel Barillot,et al.  Classification of arrayCGH data using fused SVM , 2008, ISMB.

[9]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[10]  Ramón Díaz-Uriarte,et al.  Flexible and Accurate Detection of Genomic Copy-Number Changes from aCGH , 2007, PLoS Comput. Biol..

[11]  Jun Ho Jeon,et al.  VDUP1 upregulated by TGF-β1 and 1,25-dihydorxyvitamin D3 inhibits tumor cell growth by blocking cell-cycle progression , 2003, Oncogene.

[12]  H. Russnes,et al.  Molecular classification of breast carcinomas , 2010 .

[13]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[14]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[15]  W. Kuo,et al.  Quantitative mapping of amplicon structure by array CGH identifies CYP24 as a candidate oncogene , 2000, Nature Genetics.

[16]  David J Smith,et al.  Loss of fibulin-2 expression is associated with breast cancer progression. , 2007, The American journal of pathology.

[17]  Christian A. Rees,et al.  Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[19]  Yuan Qi,et al.  Bayesian Conditional Random Fields , 2005, AISTATS.

[20]  Lars Hofmann,et al.  p73 poses a barrier to malignant transformation by limiting anchorage‐independent growth , 2008, The EMBO journal.

[21]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[22]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[23]  Kevin P. Murphy,et al.  Modeling recurrent DNA copy number alterations in array CGH data , 2007, ISMB/ECCB.

[24]  F. Mitelman,et al.  Primary chromosome abnormalities in human neoplasia. , 1989, Advances in cancer research.

[25]  Yongdai Kim,et al.  Gradient LASSO for feature selection , 2004, ICML.

[26]  Stephen Chia,et al.  Amplification of EMSY, a novel oncogene on 11q13, in high grade ovarian surface epithelial carcinomas. , 2006, Gynecologic oncology.

[27]  Ajay N. Jain,et al.  Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. , 2006, Cancer cell.

[28]  G. Tian,et al.  Statistical Applications in Genetics and Molecular Biology Sparse Logistic Regression with Lp Penalty for Biomarker Identification , 2011 .

[29]  P. Nederlof,et al.  Array-CGH and breast cancer , 2006, Breast Cancer Research.

[30]  Sun-Yuan Kung,et al.  Accurate detection of aneuploidies in array CGH and gene expression microarray data , 2004, Bioinform..

[31]  Tara L. Naylor,et al.  Distinct genomic profiles in hereditary breast tumors identified by array-based comparative genomic hybridization. , 2005, Cancer research.

[32]  P. Raychaudhuri,et al.  Cul4A Physically Associates with MDM2 and Participates in the Proteolysis of p53 , 2004, Cancer Research.