Bayesian rule learning for biomedical data mining

MOTIVATION Disease state prediction from biomarker profiling studies is an important problem because more accurate classification models will potentially lead to the discovery of better, more discriminative markers. Data mining methods are routinely applied to such analyses of biomedical datasets generated from high-throughput 'omic' technologies applied to clinical samples from tissues or bodily fluids. Past work has demonstrated that rule models can be successfully applied to this problem, since they can produce understandable models that facilitate review of discriminative biomarkers by biomedical scientists. While many rule-based methods produce rules that make predictions under uncertainty, they typically do not quantify the uncertainty in the validity of the rule itself. This article describes an approach that uses a Bayesian score to evaluate rule models. RESULTS We have combined the expressiveness of rules with the mathematical rigor of Bayesian networks (BNs) to develop and evaluate a Bayesian rule learning (BRL) system. This system utilizes a novel variant of the K2 algorithm for building BNs from the training data to provide probabilistic scores for IF-antecedent-THEN-consequent rules using heuristic best-first search. We then apply rule-based inference to evaluate the learned models during 10-fold cross-validation performed two times. The BRL system is evaluated on 24 published 'omic' datasets, and on average it performs on par or better than other readily available rule learning methods. Moreover, BRL produces models that contain on average 70% fewer variables, which means that the biomarker panels for disease prediction contain fewer markers for further verification and validation by bench scientists.

[1]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[2]  Jie Wang,et al.  Combination Data Mining Methods with New Medical Data to Predicting Outcome of Coronary Heart Disease , 2007, 2007 International Conference on Convergence Information Technology (ICCIT 2007).

[3]  J. Mesirov,et al.  Chemosensitivity prediction by transcriptional profiling , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[5]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[6]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[7]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[8]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[9]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[10]  N. Iizuka,et al.  MECHANISMS OF DISEASE Mechanisms of disease , 2022 .

[11]  Vanathi Gopalakrishnan,et al.  A bayesian rule generation framework for 'omic' biomedical data analysis , 2009 .

[12]  K. Baggerly,et al.  Pharmacoproteomic analysis of prechemotherapy and postchemotherapy plasma samples from patients receiving neoadjuvant or adjuvant chemotherapy for breast carcinoma , 2004, Cancer.

[13]  G. Wright,et al.  Proteinchip® surface enhanced laser desorption/ionization (SELDI) mass spectrometry: a novel protein biochip technology for detection of prostate cancer biomarkers in complex protein mixtures , 1999, Prostate Cancer and Prostatic Diseases.

[14]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[15]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[16]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[17]  Foster J. Provost,et al.  Increasing the Efficiency of Data Mining Algorithms with Breadth-First Marker Propagation , 1997, KDD.

[18]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[19]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[20]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Shyam Visweswaran,et al.  An Evaluation of Discretization Methods for Learning Rules from Biomedical Datasets , 2008, BIOCOMP.

[22]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[23]  K A Baggerly,et al.  Pharmacoproteomic analysis of pre-and post-chemotherapy plasma samples from patients receiving neoadjuvant or adjuvant chemotherapy for breast cancer. , 2004, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[24]  Vanathi Gopalakrishnan,et al.  Proteomic profiling of cerebrospinal fluid identifies biomarkers for amyotrophic lateral sclerosis , 2005, Journal of neurochemistry.

[25]  Johannes Fürnkranz,et al.  Incremental Reduced Error Pruning , 1994, ICML.

[26]  Vanathi Gopalakrishnan,et al.  Rule Learning for Disease-Specific Biomarker Discovery from Clinical Proteomic Mass Spectra , 2006, BioDM.

[27]  D. Lockhart,et al.  Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  E. Petricoin,et al.  Serum proteomic patterns for detection of prostate cancer. , 2002, Journal of the National Cancer Institute.

[29]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[30]  David Maxwell Chickering,et al.  A Bayesian Approach to Learning Bayesian Networks with Local Structure , 1997, UAI.

[31]  Zhihua Chen,et al.  Using Prior Knowledge and Rule Induction Methods to Discover Molecular Markers of Prognosis in Lung Cancer , 2005, AMIA.

[32]  F CooperGregory,et al.  Bayesian rule learning for biomedical data mining , 2010 .

[33]  Nir Friedman,et al.  Learning Bayesian Networks with Local Structure , 1996, UAI.

[34]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[35]  Franco Turini,et al.  DrC4.5: Improving C4.5 by means of prior knowledge , 2005, SAC '05.

[36]  Vikas Sindhwani,et al.  Information Theoretic Feature Crediting in Multiclass Support Vector Machines , 2001, SDM.

[37]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[38]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[39]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[40]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[41]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[42]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[43]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[44]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[45]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[46]  David Heckerman,et al.  Probabilistic Interpretation for MYCIN's Certainty Factors , 1990, UAI.

[47]  Shyam Visweswaran,et al.  Patient-Specific Models for Predicting the Outcomes of Patients with Community Acquired Pneumonia , 2005, AMIA.

[48]  S. G. Axline,et al.  Computer-based consultations in clinical therapeutics: explanation and rule acquisition capabilities of the MYCIN system. , 1975, Computers and biomedical research, an international journal.

[49]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[50]  Steven L. Salzberg,et al.  Book Review: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993 , 1994, Machine Learning.