Integration of High-Volume Molecular and Imaging Data for Composite Biomarker Discovery in the Study of Melanoma

In this work the effects of simple imputations are studied, regarding the integration of multimodal data originating from different patients. Two separate datasets of cutaneous melanoma are used, an image analysis (dermoscopy) dataset together with a transcriptomic one, specifically DNA microarrays. Each modality is related to a different set of patients, and four imputation methods are employed to the formation of a unified, integrative dataset. The application of backward selection together with ensemble classifiers (random forests), followed by principal components analysis and linear discriminant analysis, illustrates the implication of the imputations on feature selection and dimensionality reduction methods. The results suggest that the expansion of the feature space through the data integration, achieved by the exploitation of imputation schemes in general, aids the classification task, imparting stability as regards the derivation of putative classifiers. In particular, although the biased imputation methods increase significantly the predictive performance and the class discrimination of the datasets, they still contribute to the study of prominent features and their relations. The fusion of separate datasets, which provide a multimodal description of the same pathology, represents an innovative, promising avenue, enhancing robust composite biomarker derivation and promoting the interpretation of the biomedical problem studied.

[1]  Yixin Wang,et al.  Novel Genes Associated with Malignant Melanoma but not Benign Melanocytic Lesions , 2005, Clinical Cancer Research.

[2]  Maciej Ogorzalek,et al.  Modern Techniques for Computer-Aided Melanoma Diagnosis , 2011 .

[3]  Glen M Scholz,et al.  Domain-mediated dimerization of the Hsp90 cochaperones Harc and Cdc37. , 2005, Biochemistry.

[4]  Zuleyka Díaz Martínez,et al.  Machine learning and statistical techniques. An application to the prediction of insolvency in spanish non-life insurance companies , 2005 .

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Ken P Kleinman,et al.  Much Ado About Nothing , 2007, The American statistician.

[7]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[8]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[9]  Max Kuhn,et al.  caret: Classification and Regression Training , 2015 .

[10]  Gustavo H. Esteves,et al.  Gene network analyses point to the importance of human tissue kallikreins in melanoma progression , 2011, BMC Medical Genomics.

[11]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[12]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[13]  K. Takamatsu,et al.  Molecular cloning of a novel calcium-binding protein structurally related to hippocalcin from human brain and chromosomal mapping of its gene. , 1994, Biochimica et biophysica acta.

[14]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[15]  George Lee,et al.  Supervised Regularized Canonical Correlation Analysis: integrating histologic and proteomic measurements for predicting biochemical recurrence following prostate surgery , 2011, BMC Bioinformatics.

[16]  B. Győrffy,et al.  Gene signature of the metastatic potential of cutaneous melanoma: too much for too little? , 2010, Clinical & Experimental Metastasis.

[17]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[18]  George Lee,et al.  Multi-modal data fusion schemes for integrated classification of imaging and non-imaging biomedical data , 2011, 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[19]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[20]  C. Floyd,et al.  Optimized approach to decision fusion of heterogeneous data for breast cancer diagnosis. , 2006, Medical physics.

[21]  M Viceconti,et al.  The EuroPhysiome, STEP and a roadmap for the virtual physiological human , 2008, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[22]  Vikas Singh,et al.  Predictive markers for AD in a multi-modality framework: An analysis of MCI progression in the ADNI population , 2011, NeuroImage.

[23]  Hadley Wickham,et al.  The Split-Apply-Combine Strategy for Data Analysis , 2011 .

[24]  Torsten Rohlfing,et al.  Information Fusion in Biomedical Image Analysis: Combination of Data vs. Combination of Interpretations , 2005, IPMI.

[25]  A. Berchuck,et al.  Matrix Metalloproteinase-1 Gene Promoter Polymorphism and Risk of Ovarian Cancer , 2003, The Journal of the Society for Gynecologic Investigation: JSGI.

[26]  Ilias Maglogiannis,et al.  Skin lesion diagnosis from images using novel ensemble classification techniques , 2010, Proceedings of the 10th IEEE International Conference on Information Technology and Applications in Biomedicine.

[27]  David Ardia,et al.  DEoptim: An R Package for Global Optimization by Differential Evolution , 2009 .

[28]  Sean R. Davis,et al.  GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor , 2007, Bioinform..

[29]  S. Tuominen,et al.  Data combination and feature selection for multi-source forest inventory , 2008 .

[30]  M. Balázs,et al.  Genomics of Human Malignant Melanoma , 2011 .

[31]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[32]  B. Emanuel,et al.  Fusion of a fork head domain gene to PAX3 in the solid tumour alveolar rhabdomyosarcoma , 1993, Nature Genetics.

[33]  George Lee,et al.  A knowledge representation framework for integration, classification of multi-scale imaging and non-imaging data: Preliminary results in predicting prostate cancer recurrence by fusing mass spectrometry and histology , 2009, 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[34]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[35]  Ilias Maglogiannis,et al.  Overview of Advanced Computer Vision Systems for Skin Lesions Characterization , 2009, IEEE Transactions on Information Technology in Biomedicine.

[36]  Daniel Rueckert,et al.  Random forest-based similarity measures for multi-modal classification of Alzheimer's disease , 2013, NeuroImage.

[37]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.