stepwiseCM: An R Package for Stepwise Classification of Cancer Samples Using Multiple Heterogeneous Data Sets

This paper presents the R/Bioconductor package stepwiseCM, which classifies cancer samples using two heterogeneous data sets in an efficient way. The algorithm is able to capture the distinct classification power of two given data types without actually combining them. This package suits for classification problems where two different types of data sets on the same samples are available. One of these data types has measurements on all samples and the other one has measurements on some samples. One is easy to collect and/or relatively cheap (eg, clinical covariates) compared to the latter (high-dimensional data, eg, gene expression). One additional application for which stepwiseCM is proven to be useful as well is the combination of two high-dimensional data types, eg, DNA copy number and mRNA expression. The package includes functions to project the neighborhood information in one data space to the other to determine a potential group of samples that are likely to benefit most by measuring the second type of covariates. The two heterogeneous data spaces are connected by indirect mapping. The crucial difference between the stepwise classification strategy implemented in this package and the existing packages is that our approach aims to be cost-efficient by avoiding measuring additional covariates, which might be expensive or patient-unfriendly, for a potentially large subgroup of individuals. Moreover, in diagnosis for these individuals test, results would be quickly available, which may lead to reduced waiting times and hence lower the patients’ distress. The improvement described remedies the key limitations of existing packages, and facilitates the use of the stepwiseCM package in diverse applications.

[1]  D. Machin,et al.  Prognostic Factor Studies , 2005 .

[2]  San Cristóbal Mateo,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[3]  Geoffrey J. McLachlan,et al.  Integrative mixture of experts to combine clinical factors and gene markers , 2010, Bioinform..

[4]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[5]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[6]  M. A. van de Wiel,et al.  Confidence scores for prediction models , 2011, Biometrical journal. Biometrische Zeitschrift.

[7]  Jialiang Li,et al.  Adjusting confounders in ranking biomarkers: a model-based ROC approach , 2012, Briefings Bioinform..

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Mark A. van de Wiel,et al.  Stepwise classification of cancer samples using clinical and molecular data , 2011, BMC Bioinformatics.

[10]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics (e1071), TU Wien , 2014 .

[11]  Stefan Michiels,et al.  Gene expression profiling: does it add predictive accuracy to clinical characteristics in cancer prognosis? , 2007, European journal of cancer.

[12]  Joshy George,et al.  Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. , 2006, Cancer research.

[13]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[14]  Yanjun Qi,et al.  Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources , 2004, Pacific Symposium on Biocomputing.

[15]  Anne-Laure Boulesteix,et al.  Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value , 2008, Bioinform..