Effect of data combination on predictive modeling: a study using gene expression data.

BACKGROUND The quality of predictive modeling in biomedicine depends on the amount of data available for model building. OBJECTIVE To study the effect of combining microarray data sets on feature selection and predictive modeling performance. METHODS Empirical evaluation of stability of feature selection and discriminatory power of classifiers using three previously published gene expression data sets, analyzed both individually and in combination. RESULTS Feature selection was not robust for the individual as well as for the combined data sets. The classification performance of models built on individual and combined data sets was heavily dependent on the data set from which the features were extracted. CONCLUSION We identified volatility of feature selection as contributing factor to some of the problems faced by predictive modeling using microarray data.

[1]  Terrence J. Sejnowski,et al.  Comparison of machine learning and traditional classifiers in glaucoma diagnosis , 2002, IEEE Transactions on Biomedical Engineering.

[2]  Lucila Ohno-Machado,et al.  A Comparison of Machine Learning Methods for the Diagnosis of Pigmented Skin Lesions , 2001, J. Biomed. Informatics.

[3]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[4]  Fabien Reyal,et al.  Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability , 2008, BMC Genomics.

[5]  P. Hall,et al.  An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Patrick Cahan,et al.  Meta-analysis of microarray results: challenges, opportunities, and recommendations for standardization. , 2007, Gene.

[7]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[8]  M. J. van de Vijver,et al.  Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. , 2006, Journal of the National Cancer Institute.

[9]  Ki-Yeol Kim,et al.  Novel and simple transformation algorithm for combining microarray data sets , 2007, BMC Bioinformatics.

[10]  J. Thierry-Mieg,et al.  AceView: a comprehensive cDNA-supported gene and transcripts annotation , 2006, Genome Biology.

[11]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[12]  Jing Wang,et al.  Merging microarray data, robust feature selection, and predicting prognosis in prostate cancer , 2006, Cancer informatics.

[13]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[14]  See-Kiong Ng,et al.  On combining multiple microarray studies for improved functional classification by whole-dataset feature selection. , 2003, Genome informatics. International Conference on Genome Informatics.

[15]  Donald Geman,et al.  Merging microarray data from separate breast cancer studies provides a robust prognostic test , 2008, BMC Bioinformatics.

[16]  B Lausen,et al.  Comparison of classifiers applied to confocal scanning laser ophthalmoscopy data. , 2008, Methods of information in medicine.

[17]  Joel S. Parker,et al.  Adjustment of systematic microarray data biases , 2004, Bioinform..

[18]  P. Bucher,et al.  Can Survival Prediction Be Improved By Merging Gene Expression Data Sets? , 2009, PloS one.

[19]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.