Information content and analysis methods for Multi-Modal High-Throughput Biomedical Data

The spectrum of modern molecular high-throughput assaying includes diverse technologies such as microarray gene expression, miRNA expression, proteomics, DNA methylation, among many others. Now that these technologies have matured and become increasingly accessible, the next frontier is to collect “multi-modal” data for the same set of subjects and conduct integrative, multi-level analyses. While multi-modal data does contain distinct biological information that can be useful for answering complex biology questions, its value for predicting clinical phenotypes and contributions of each type of input remain unknown. We obtained 47 datasets/predictive tasks that in total span over 9 data modalities and executed analytic experiments for predicting various clinical phenotypes and outcomes. First, we analyzed each modality separately using uni-modal approaches based on several state-of-the-art supervised classification and feature selection methods. Then, we applied integrative multi-modal classification techniques. We have found that gene expression is the most predictively informative modality. Other modalities such as protein expression, miRNA expression, and DNA methylation also provide highly predictive results, which are often statistically comparable but not superior to gene expression data. Integrative multi-modal analyses generally do not increase predictive signal compared to gene expression data.

[1]  C. Nordling A New Theory on the Cancer-inducing Mechanism , 1953, British Journal of Cancer.

[2]  Nordling Co A New Theory on the Cancer-inducing Mechanism , 1953 .

[3]  A. Knudson Mutation and cancer: statistical study of retinoblastoma. , 1971, Proceedings of the National Academy of Sciences of the United States of America.

[4]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[5]  Stefun D. Leigh U-Statistics Theory and Practice , 1992 .

[6]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[7]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Ah Chung Tsoi,et al.  Universal Approximation Using Feedforward Neural Networks: A Survey of Some Existing Methods, and Some New Results , 1998, Neural Networks.

[10]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[11]  Tobias Scheffer,et al.  Error Estimation and Model Selection , 1999, Künstliche Intell..

[12]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[13]  Y. Istefanopulos,et al.  IEEE Engineering in Medicine and Biology Society , 2019, IEEE Transactions on Biomedical Engineering.

[14]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[15]  E. Petricoin,et al.  Clinical proteomics: translating benchside promise into bedside reality , 2002, Nature Reviews Drug Discovery.

[16]  M. West,et al.  Gene expression predictors of breast cancer outcomes , 2003, The Lancet.

[17]  Charles X. Ling,et al.  AUC: A Better Measure than Accuracy in Comparing Learning Algorithms , 2003, Canadian Conference on AI.

[18]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[19]  C. Ling,et al.  AUC: a Statistically Consistent and more Discriminating Measure than Accuracy , 2003, IJCAI.

[20]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[21]  Nello Cristianini,et al.  Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast , 2003, Pacific Symposium on Biocomputing.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  M. West,et al.  Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Barbara Hammer,et al.  A Note on the Universal Approximation Capability of Support Vector Machines , 2003, Neural Processing Letters.

[25]  T.R. Martinez,et al.  Using permutations instead of student's t distribution for p-values in paired-difference algorithm comparisons , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[26]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[27]  Howard Y. Chang,et al.  Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Olga G. Troyanskaya,et al.  Putting microarrays in a context: Integrated analysis of diverse biological data , 2005, Briefings Bioinform..

[29]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[30]  J. Lancaster,et al.  Integration of Clinical Information and Gene Expression Profiles for Prediction of Chemo-Response for Ovarian Cancer , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[31]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[32]  Theodor Mader,et al.  Feature Selection with the CLOP Package , 2006 .

[33]  A. Daemen,et al.  Integration of clinical and microarray data with kernel methods , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[34]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[35]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[36]  Constantin F. Aliferis,et al.  Challenges in the Analysis of Mass-Throughput Data: A Technical Commentary from the Statistical Machine Learning Perspective , 2006, Cancer informatics.

[37]  J. Suykens,et al.  A kernel-based integration of genome-wide data for clinical decision support , 2009, Genome Medicine.

[38]  N. Lytkin,et al.  Causal graph-based analysis of genome-wide association data in rheumatoid arthritis , 2011, Biology Direct.

[39]  C. Sander,et al.  Integrative genomic profiling of human prostate cancer. , 2010, Cancer cell.

[40]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[41]  Joel H. Saltz,et al.  Integrative, Multimodal Analysis of Glioblastoma Using TCGA Molecular Data, Pathology Images, and Clinical Outcomes , 2011, IEEE Transactions on Biomedical Engineering.

[42]  Yorgos Goletsis,et al.  Enabling heterogeneous data integration and biomedical event prediction through ICT: the test case of cancer reoccurrence. , 2011, Advances in experimental medicine and biology.

[43]  N. Carter,et al.  Massive Genomic Rearrangement Acquired in a Single Catastrophic Event during Cancer Development , 2011, Cell.

[44]  Chee Woon Wang,et al.  Potentiality of a triple microRNA classifier: miR-193a-3p, miR-23a and miR-338-5p for early detection of colorectal cancer , 2013, BMC Cancer.

[45]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[46]  Rondi A. Butler,et al.  Identification of an epigenetic profile classifier that is associated with survival in head and neck cancer. , 2012, Cancer research.

[47]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .