Multisample aCGH Data Analysis via Total Variation and Spectral Regularization

DNA copy number variation (CNV) accounts for a large proportion of genetic variation. One commonly used approach to detecting CNVs is array-based comparative genomic hybridization (aCGH). Although many methods have been proposed to analyze aCGH data, it is not clear how to combine information from multiple samples to improve CNV detection. In this paper, we propose to use a matrix to approximate the multisample aCGH data and minimize the total variation of each sample as well as the nuclear norm of the whole matrix. In this way, we can make use of the smoothness property of each sample and the correlation among multiple samples simultaneously in a convex optimization framework. We also developed an efficient and scalable algorithm to handle large-scale data. Experiments demonstrate that the proposed method outperforms the state-of-the-art techniques under a wide range of scenarios and it is capable of processing large data sets with millions of probes.

[1]  Tao Xie,et al.  Inferring causal genomic alterations in breast cancer using gene expression data , 2011, BMC Systems Biology.

[2]  B. Rovin,et al.  The Influence of CCL 3 L 1 Gene – Containing Segmental Duplications on HIV-1 / AIDS Susceptibility , 2009 .

[3]  E. Lander,et al.  Assessing the significance of chromosomal aberrations in cancer: Methodology and application to glioma , 2007, Proceedings of the National Academy of Sciences.

[4]  R. Tibshirani,et al.  Spatial smoothing and hot spot detection for CGH data using the fused lasso. , 2008, Biostatistics.

[5]  Emmanuel Barillot,et al.  Analysis of array CGH data: from signal ratio to gain and loss of DNA regions , 2004, Bioinform..

[6]  Peter J. Park,et al.  Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data , 2005, Bioinform..

[7]  Kevin P. Murphy,et al.  Modeling recurrent DNA copy number alterations in array CGH data , 2007, ISMB/ECCB.

[8]  Christian J Stoeckert,et al.  STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. , 2006, Genome research.

[9]  Marcus Hutter,et al.  Bayesian DNA copy number analysis , 2009, BMC Bioinformatics.

[10]  Yan Zhang,et al.  CanPredict: a computational tool for predicting cancer-associated missense mutations , 2007, Nucleic Acids Res..

[11]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[12]  A. Rinaldo Properties and refinements of the fused lasso , 2008, 0805.0234.

[13]  Yonina C. Eldar,et al.  A fast and flexible method for the segmentation of aCGH data , 2008, ECCB.

[14]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[15]  Nancy R. Zhang,et al.  Detecting simultaneous changepoints in multiple sequences. , 2010, Biometrika.

[16]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[17]  Oscar M. Rueda and Ramon Diaz-Uriarte Finding Recurrent Copy Number Alteration Regions: A Review of Methods , 2010 .

[18]  Jane Fridlyand,et al.  Bioinformatics Original Paper a Comparison Study: Applying Segmentation to Array Cgh Data for Downstream Analyses , 2022 .

[19]  Jieping Ye,et al.  An efficient algorithm for a class of fused lasso problems , 2010, KDD.

[20]  Stéphane Robin,et al.  Joint segmentation, calling, and normalization of multiple CGH profiles. , 2011, Biostatistics.

[21]  Johan Staaf,et al.  Continuous-index hidden Markov modelling of array CGH copy number data , 2007, Bioinform..

[22]  Michele Ceccarelli,et al.  VEGA: variational segmentation for copy number detection , 2010, Bioinform..

[23]  Anne E Carpenter,et al.  Ultrasome: efficient aberration caller for copy number studies of ultra-high resolution , 2009, Bioinform..

[24]  Christian J Stoeckert,et al.  Assessing the Significance of Conserved Genomic Aberrations Using High Resolution Genomic Microarrays , 2007, PLoS genetics.

[25]  Marieke E. Timmerman,et al.  Smoothing waves in array CGH tumor profiles , 2009, Bioinform..

[26]  Ajay N. Jain,et al.  Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. , 2006, Cancer cell.

[27]  Franck Picard,et al.  A statistical approach for array CGH data analysis , 2005, BMC Bioinformatics.

[28]  Antonio Ortega,et al.  Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA , 2009, Bioinform..

[29]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[30]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[31]  D. Pinkel,et al.  Array comparative genomic hybridization and its applications in cancer , 2005, Nature Genetics.

[32]  Azriel Rosenfeld,et al.  Robust regression methods for computer vision: A review , 1991, International Journal of Computer Vision.

[33]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[34]  GusfieldDan Introduction to the IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2004 .

[35]  Christian A. Rees,et al.  Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[36]  R. Tibshirani,et al.  A fused lasso latent feature model for analyzing multi-sample aCGH data. , 2011, Biostatistics.

[37]  M. Wigler,et al.  Circular binary segmentation for the analysis of array-based DNA copy number data. , 2004, Biostatistics.

[38]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[39]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[40]  Ken Chen,et al.  CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data , 2010, Bioinform..

[41]  Simon Tavaré,et al.  BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data , 2006, Bioinform..

[42]  J. Lupski Structural variation in the human genome. , 2007, The New England journal of medicine.