Integrating multiple molecular sources into a clinical risk prediction signature by extracting complementary information

BackgroundHigh-throughput technology allows for genome-wide measurements at different molecular levels for the same patient, e.g. single nucleotide polymorphisms (SNPs) and gene expression. Correspondingly, it might be beneficial to also integrate complementary information from different molecular levels when building multivariable risk prediction models for a clinical endpoint, such as treatment response or survival. Unfortunately, such a high-dimensional modeling task will often be complicated by a limited overlap of molecular measurements at different levels between patients, i.e. measurements from all molecular levels are available only for a smaller proportion of patients.ResultsWe propose a sequential strategy for building clinical risk prediction models that integrate genome-wide measurements from two molecular levels in a complementary way. To deal with partial overlap, we develop an imputation approach that allows us to use all available data. This approach is investigated in two acute myeloid leukemia applications combining gene expression with either SNP or DNA methylation data. After obtaining a sparse risk prediction signature e.g. from SNP data, an automatically selected set of prognostic SNPs, by componentwise likelihood-based boosting, imputation is performed for the corresponding linear predictor by a linking model that incorporates e.g. gene expression measurements. The imputed linear predictor is then used for adjustment when building a prognostic signature from the gene expression data. For evaluation, we consider stability, as quantified by inclusion frequencies across resampling data sets. Despite an extremely small overlap in the application example with gene expression and SNPs, several genes are seen to be more stably identified when taking the (imputed) linear predictor from the SNP data into account. In the application with gene expression and DNA methylation, prediction performance with respect to survival also indicates that the proposed approach might work well.ConclusionsWe consider imputation of linear predictor values to be a feasible and sensible approach for dealing with partial overlap in complementary integrative analysis of molecular measurements at different levels. More generally, these results indicate that a complementary strategy for integrating different molecular levels can result in more stable risk prediction signatures, potentially providing a more reliable insight into the underlying biology.

[1]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[2]  Fernando F. Costa,et al.  Abnormal Expression of Ndfip2 and Cbl in Acute Myeloid Leukemia and Myelodysplastic Syndrome Patients: Role of Ubiquitin Proteasome System in Myeloid Neoplasms and Normal Hematopoiesis , 2011 .

[3]  Anne-Laure Boulesteix,et al.  Stability Investigations of Multivariable Regression Models Derived from Low- and High-Dimensional Data , 2011, Journal of biopharmaceutical statistics.

[4]  Michael A Dolan,et al.  Structural variants of IFNα preferentially promote antiviral functions. , 2011, Blood.

[5]  Anne-Laure Boulesteix,et al.  On stability issues in deriving multivariable regression models , 2015, Biometrical journal. Biometrische Zeitschrift.

[6]  Zoubin Ghahramani,et al.  Bayesian correlated clustering to integrate multiple datasets , 2012, Bioinform..

[7]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .

[8]  K Holzmann,et al.  Identification of acquired copy number alterations and uniparental disomies in cytogenetically normal acute myeloid leukemia using high-resolution single-nucleotide polymorphism analysis , 2010, Leukemia.

[9]  Holger Schwender,et al.  Testing SNPs and sets of SNPs for importance in association studies. , 2011, Biostatistics.

[10]  Adam A. Margolin,et al.  Addendum: The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity , 2012, Nature.

[11]  Xiao Zhang,et al.  Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis , 2010, BMC Bioinformatics.

[12]  Schumacher Martin,et al.  Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples , 2008 .

[13]  Claudio Lottaz,et al.  Gene-expression profiling identifies distinct subclasses of core binding factor acute myeloid leukemia , 2007 .

[14]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[15]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Tom F. Wilderjans,et al.  A flexible framework for sparse simultaneous component based data integration , 2011, BMC Bioinformatics.

[18]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[19]  Harald Binder,et al.  Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models , 2008, BMC Bioinformatics.

[20]  Gerhard Tutz,et al.  Boosting ridge regression , 2007, Comput. Stat. Data Anal..

[21]  Harald Binder,et al.  Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures , 2014, PloS one.

[22]  Harald Binder,et al.  Assessment of survival prediction models based on microarray data , 2007, Bioinform..

[23]  Thomas A Gerds,et al.  Efron‐Type Measures of Prediction Error for Survival Analysis , 2007, Biometrics.

[24]  R. Tibshirani,et al.  Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. , 2004, The New England journal of medicine.

[25]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[26]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[27]  Fatima Al-Shahrour,et al.  Musashi-2 regulates normal hematopoiesis and promotes aggressive myeloid leukemia , 2010, Nature Medicine.