Heavy-Tailed Noise Suppression and Derivative Wavelet Scalogram for Detecting DNA Copy Number Aberrations

Most existing array comparative genomic hybridization (array CGH) data processing methods and evaluation models assumed that the probability density function (pdf) of noise in array CGH data is a Gaussian distribution. However, in practice, such noise distribution is peaky and heavy-tailed. Therefore, a Gaussian pdf is not adequate to approximate the noise in array CGH data and hence introduces wrong detections of chromosomal aberrations and leads misunderstanding on disease pathogenesis. A more accurate and sufficient model of noise in array CGH data is necessary and beneficial to the detection of DNA copy number variations. We analyze the real array CGH data from different platforms and show that the distribution of noise in array CGH data is fitted very well by generalized Gaussian distribution (GGD). Based on our new noise model, we propose a novel array CGH processing method combining the advantages of both the smoothing and segmentation approaches. The new method uses generalized Gaussian bivariate shrinkage function and one-directional derivative wavelet scalogram in generalized Gaussian noise. In the smoothing step, with the new generalized Gaussian noise model, we derive the heavy-tailed noise suppression algorithm in stationary wavelet domain. In the segmentation step, the 1D Gaussian derivative wavelet scalogram is employed to detect break points. Both real and simulated array CGH data with different noises (such as Gaussian noise, GGD noise, and real noise) are used in our experiments. We demonstrate that our new method outperforms other state-of-the-art methods, in terms of both root mean squared errors and receiver operating characteristic curves.

[1]  Heng Huang,et al.  Array CGH data modeling and smoothing in Stationary Wavelet Packet Transform domain , 2008, BMC Genomics.

[2]  Douglas Grove,et al.  Denoising array-based comparative genomic hybridization data using wavelets. , 2005, Biostatistics.

[3]  Jane Fridlyand,et al.  Bioinformatics Original Paper a Comparison Study: Applying Segmentation to Array Cgh Data for Downstream Analyses , 2022 .

[4]  Johan Staaf,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm359 Data and text mining , 2022 .

[5]  Ajay N. Jain,et al.  Assembly of microarrays for genome-wide measurement of DNA copy number , 2001, Nature Genetics.

[6]  Minh N. Do,et al.  Wavelet-based texture retrieval using generalized Gaussian density and Kullback-Leibler distance , 2002, IEEE Trans. Image Process..

[7]  Paul H. C. Eilers,et al.  Quantile smoothing of array CGH data , 2005, Bioinform..

[8]  Martin Vetterli,et al.  Adaptive wavelet thresholding for image denoising and compression , 2000, IEEE Trans. Image Process..

[9]  L. Recht,et al.  High-resolution genome-wide mapping of genetic alterations in human glial brain tumors. , 2005, Cancer research.

[10]  E. Eichler,et al.  Closing gaps in the human genome with fosmid resources generated from multiple individuals , 2008, Nature Genetics.

[11]  Yinhe Cao,et al.  Exploiting noise in array CGH data to improve detection of DNA copy number change , 2007, Nucleic acids research.

[12]  Yuhang Wang,et al.  A novel stationary wavelet denoising algorithm for array-based DNA Copy Number data , 2007, Int. J. Bioinform. Res. Appl..

[13]  Heng Huang,et al.  Stationary Wavelet Packet Transform and Dependent Laplacian Bivariate Shrinkage Estimator for Array-CGH Data Smoothing , 2010, J. Comput. Biol..

[14]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[15]  Nha Nguyen,et al.  Gaussian derivative wavelets identify dynamic changes in histone modification , 2014, 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology.

[16]  An P. N. Vo,et al.  A Wavelet Approach to Detect Enriched Regions and Explore Epigenomic Landscapes , 2014, J. Comput. Biol..

[17]  I. Johnstone,et al.  Wavelet Threshold Estimators for Data with Correlated Noise , 1997 .

[18]  An P. N. Vo,et al.  A wavelet-based method to exploit epigenomic language in the regulatory region , 2014, Bioinform..

[19]  Arthur S. Lee,et al.  Analysis of copy number variation in the rhesus macaque genome identifies candidate loci for evolutionary and human disease studies. , 2008, Human molecular genetics.

[20]  A. Tsalenko,et al.  The fine-scale and complex architecture of human copy-number variation. , 2008, American journal of human genetics.

[21]  Antonio Ortega,et al.  Sparse representation and Bayesian detection of genome copy number alterations from microarray data , 2008, Bioinform..

[22]  Thomas J. Nicholas,et al.  The genomic architecture of segmental duplications and associated copy number variants in dogs. , 2008, Genome research.

[23]  Levent Sendur,et al.  A bivariate shrinkage function for wavelet-based denoising , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  An P. N. Vo,et al.  A Stationary Wavelet Entropy-Based Clustering Approach Accurately Predicts Gene Expression , 2015, J. Comput. Biol..

[25]  Peter J. Park,et al.  Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data , 2005, Bioinform..

[26]  M. Wigler,et al.  Circular binary segmentation for the analysis of array-based DNA copy number data. , 2004, Biostatistics.

[27]  Philippe Froguel,et al.  Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: implications for association studies of complex diseases. , 2007, Human molecular genetics.

[28]  E. Eichler,et al.  Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions , 2010, Nature Methods.

[29]  Heng Huang,et al.  Denoising of Array-Based DNA Copy Number Data Using The Dual-tree Complex Wavelet Transform , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[30]  Antonio Ortega,et al.  Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA , 2009, Bioinform..

[31]  Yonina C. Eldar,et al.  A fast and flexible method for the segmentation of aCGH data , 2008, ECCB.

[32]  David L. Donoho,et al.  De-noising by soft-thresholding , 1995, IEEE Trans. Inf. Theory.