Knowledge transfer via classification rules using functional mapping for integrative modeling of gene expression data

BackgroundMost ‘transcriptomic’ data from microarrays are generated from small sample sizes compared to the large number of measured biomarkers, making it very difficult to build accurate and generalizable disease state classification models. Integrating information from different, but related, ‘transcriptomic’ data may help build better classification models. However, most proposed methods for integrative analysis of ‘transcriptomic’ data cannot incorporate domain knowledge, which can improve model performance. To this end, we have developed a methodology that leverages transfer rule learning and functional modules, which we call TRL-FM, to capture and abstract domain knowledge in the form of classification rules to facilitate integrative modeling of multiple gene expression data. TRL-FM is an extension of the transfer rule learner (TRL) that we developed previously. The goal of this study was to test our hypothesis that “an integrative model obtained via the TRL-FM approach outperforms traditional models based on single gene expression data sources”.ResultsTo evaluate the feasibility of the TRL-FM framework, we compared the area under the ROC curve (AUC) of models developed with TRL-FM and other traditional methods, using 21 microarray datasets generated from three studies on brain cancer, prostate cancer, and lung disease, respectively. The results show that TRL-FM statistically significantly outperforms TRL as well as traditional models based on single source data. In addition, TRL-FM performed better than other integrative models driven by meta-analysis and cross-platform data merging.ConclusionsThe capability of utilizing transferred abstract knowledge derived from source data using feature mapping enables the TRL-FM framework to mimic the human process of learning and adaptation when performing related tasks. The novel TRL-FM methodology for integrative modeling for multiple ‘transcriptomic’ datasets is able to intelligently incorporate domain knowledge that traditional methods might disregard, to boost predictive power and generalization performance. In this study, TRL-FM’s abstraction of knowledge is achieved in the form of functional modules, but the overall framework is generalizable in that different approaches of acquiring abstract knowledge can be integrated into this framework.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Vanathi Gopalakrishnan,et al.  Rule Learning for Disease-Specific Biomarker Discovery from Clinical Proteomic Mass Spectra , 2006, BioDM.

[3]  Hyun Cheol Chung,et al.  Oncogenic Pathway Combinations Predict Clinical Prognosis in Gastric Cancer , 2009, PLoS genetics.

[4]  Ann Nowé,et al.  Comparison of Merging and Meta-Analysis as Alternative Approaches for Integrative Gene Expression Analysis , 2014, ISRN bioinformatics.

[5]  Tom Fawcett,et al.  Using rule sets to maximize ROC performance , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[6]  A. Chinnaiyan,et al.  Integrative analysis of the cancer transcriptome , 2005, Nature Genetics.

[7]  Francisco Azuaje,et al.  Bioinformatics and biomarker discovery : "omic" data analysis for personalised medicine , 2010 .

[8]  Philip Ganchev Transfer rule learning for biomarker discovery and verification from related data sets , 2011 .

[9]  A. Nobel,et al.  Concordance among Gene-Expression – Based Predictors for Breast Cancer , 2011 .

[10]  Merit Cudkowicz,et al.  Discovery and verification of amyotrophic lateral sclerosis biomarkers by proteomics , 2010, Muscle & nerve.

[11]  Vanathi Gopalakrishnan,et al.  Proteomic profiling of cerebrospinal fluid identifies biomarkers for amyotrophic lateral sclerosis , 2005, Journal of neurochemistry.

[12]  Jia Li,et al.  An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies , 2011, 1108.3180.

[13]  Mary F. McGuire,et al.  Data driven linear algebraic methods for analysis of molecular pathways: Application to disease progression in shock/trauma , 2012, J. Biomed. Informatics.

[14]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..

[15]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[16]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[17]  G. Tseng,et al.  Comprehensive literature review and statistical considerations for microarray meta-analysis , 2012, Nucleic acids research.

[18]  Ah-Hwee Tan,et al.  Data Mining for Biomedical Applications , 2006, Lecture Notes in Computer Science.

[19]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[20]  Beau Dabbs,et al.  Summary and discussion of : “ Controlling the False Discovery Rate : A Practical and Powerful Approach to Multiple Testing , 2014 .

[21]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[22]  Roland Eils,et al.  Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes , 2005, BMC Bioinformatics.

[23]  J. Mesirov,et al.  The limitations of simple gene set enrichment analysis assuming gene independence , 2011, J. Biomed. Informatics.

[24]  Lincoln Stein,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Res..

[25]  Russ B. Altman,et al.  PharmGKB: the Pharmacogenetics Knowledge Base , 2002, Nucleic Acids Res..

[26]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[27]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[28]  Kenneth McGarry,et al.  A survey of interestingness measures for knowledge discovery , 2005, The Knowledge Engineering Review.

[29]  Lana X. Garmire,et al.  A Novel Model to Combine Clinical and Pathway-Based Transcriptomic Information for the Prognosis Prediction of Breast Cancer , 2014, PLoS Comput. Biol..

[30]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[31]  D. Gentner,et al.  Structure mapping in analogy and similarity. , 1997 .

[32]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[33]  William L. Bigbee,et al.  Transfer learning of classification rules for biomarker discovery and verification from molecular profiling studies , 2011, J. Biomed. Informatics.

[34]  S. Lowe,et al.  Control of apoptosis by p53 , 2003, Oncogene.

[35]  Andrey A. Ptitsyn,et al.  Systems biology approach to identification of biomarkers for metastatic progression in cancer , 2008, BMC Bioinformatics.

[36]  Roberto Romero,et al.  A Comparison of Gene Set Analysis Methods in Terms of Sensitivity, Prioritization and Specificity , 2013, PloS one.

[37]  Foster J. Provost,et al.  RL4: a tool for knowledge-based induction , 1990, [1990] Proceedings of the 2nd International IEEE Conference on Tools for Artificial Intelligence.

[38]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[39]  Francisco Azuaje Bioinformatics and Biomarker Discovery , 2010 .

[40]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[41]  Vanathi Gopalakrishnan,et al.  A Multiplexed Serum Biomarker Immunoassay Panel Discriminates Clinical Lung Cancer Patients from High-Risk Individuals Found to be Cancer-Free by CT Screening , 2012, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[42]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[43]  J. Hopfield,et al.  From molecular to modular cell biology , 1999, Nature.

[44]  Petri Törönen,et al.  Theme discovery from gene lists for identification and viewing of multiple functional groups , 2005, BMC Bioinformatics.

[45]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[46]  Shyam Visweswaran,et al.  Application of an efficient Bayesian discretization method to biomedical data , 2011, BMC Bioinformatics.

[47]  Sandhya Samarasinghe,et al.  Microarray data integration: frameworks and a list of underlying issues , 2010 .

[48]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[49]  G. Tseng,et al.  Comprehensive literature review and statistical considerations for GWAS meta-analysis , 2012, Nucleic acids research.