OPTIMIZED CROSS-STUDY ANALYSIS OF MICROARRAY-BASED PREDICTORS

Background: Microarray-based gene expression analysis is widely used in cancer research to discover molecular signatures for cancer classification and prediction. In addition to numerous independent profiling projects, a number of investigators have analyzed multiple published data sets for purposes of cross-study validation. However, the diverse microarray platforms and technical approaches make direct comparisons across studies difficult, and without means to identify aberrant data patterns, less than optimal. To address this issue, we previously developed an integrative correlation approach to systematically address agreement of gene expression measurements across studies, providing a basis for cross-study validation analysis. Here we generalize this methodology to provide a metric for evaluating the overall efficacy of preprocessing and cross-referencing, and explore optimal combinations of filtering and cross-referencing strategies. We operate in the context of validating prognostic breast cancer gene expression signatures on data reported by three different groups, each using a different platform. Results: To evaluate overall cross-platform reproducibility in the context of a specific prediction problem, we suggest integrative association, that is the the cross-study correlation of gene-specific measure of association with the phenotype predicted. Specifically, in this paper we use the correlation among the Cox proportional