Stratifying tumour subtypes based on copy number alteration profiles using next-generation sequence data

MOTIVATION The role of personalized medicine and target treatment in the clinical management of cancer patients has become increasingly important in recent years. This has made the task of precise histological substratification of cancers crucial. Increasingly, genomic data are being seen as a valuable classifier. Specifically, copy number alteration (CNA) profiles generated by next-generation sequencing (NGS) can become a determinant for tumours subtyping. The principle purpose of this study is to devise a model with good prediction capability for the tumours histological subtypes as a function of both the patients covariates and their genome-wide CNA profiles from NGS data. RESULTS We investigate a logistic regression for modelling tumour histological subtypes as a function of the patients' covariates and their CNA profiles, in a mixed model framework. The covariates, such as age and gender, are considered as fixed predictors and the genome-wide CNA profiles are considered as random predictors. We illustrate the application of this model in lung and oral cancer datasets, and the results indicate that the tumour histological subtypes can be modelled with a good fit. Our cross-validation indicates that the logistic regression exhibits the best prediction relative to other classification methods we considered in this study. The model also exhibits the best agreement in the prediction between smooth-segmented and circular binary-segmented CNA profiles. AVAILABILITY AND IMPLEMENTATION An R package to run a logistic regression is available in http://www1.maths.leeds.ac.uk/~arief/R/CNALR/. CONTACT a.gusnanto@leeds.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  G. Getz,et al.  GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers , 2011, Genome Biology.

[2]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[3]  Pia Veldt Larsen,et al.  In All Likelihood: Statistical Modelling and Inference Using Likelihood , 2003 .

[4]  H. Archard Verrucous carcinoma of the oral cavity. , 1970, Transactions of the International Conference on Oral Surgery.

[5]  Henry M. Wood,et al.  Estimating optimal window size for analysis of low-coverage next-generation sequence data , 2014, Bioinform..

[6]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[7]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[8]  Agus Salim,et al.  Classification of array CGH data using smoothed logistic regression model , 2009, Statistics in medicine.

[9]  M. Wigler,et al.  Circular binary segmentation for the analysis of array-based DNA copy number data. , 2004, Biostatistics.

[10]  K. Maclennan,et al.  Using next-generation sequencing for high resolution multiplex analysis of copy number variation from nanogram quantities of DNA from formalin-fixed paraffin-embedded specimens , 2010, Nucleic acids research.

[11]  Paul H. C. Eilers,et al.  Classification of microarray data with penalized logistic regression , 2001, SPIE BiOS.

[12]  A. Gazdar Should we continue to use the term non-small-cell lung cancer? , 2010, Annals of oncology : official journal of the European Society for Medical Oncology.

[13]  Johan Staaf,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm359 Data and text mining , 2022 .

[14]  Henry M. Wood,et al.  Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data , 2012, Bioinform..

[15]  M. Reinders,et al.  KC-SMARTR: An R package for detection of statistically significant aberrations in multi-experiment aCGH data , 2010, BMC Research Notes.