Fusing Domain Knowledge with Data : Applications in Bioinformatics

Objective:Successful use of classifiers that learn to make decisions from a set of patient examples require robust methods for performance estimation. Recently many promising approaches for determination of an upper bound for the error rate of a single classifier have been reported but the Bayesian credibility interval (Cl) obtained from a conventional holdout test still delivers one of the tightest bounds. The conventional Bayesian CI becomes unacceptably large in real world applications where the test set sizes are less than a few hundred. The source of this problem is that fact that the Cl is determined exclusively by the result on the test examples. In other words, there is no information at all provided by the uniform prior density distribution employed which reflects complete lack of prior knowledge about the unknown error rate. Therefore, the aim of the study reported here was to study a maximum entropy (ME) based approach to improved prior knowledge and Bayesian CIs, demonstrating its relevance for biomedical research and clinical practice.Method and material:It is demonstrated how a refined non-uniform prior density distribution can be obtained by means of the ME principle using empirical results from a few designs and tests using non-overlapping sets of examples.Results:Experimental results show that ME based priors improve the CIs when employed to four quite different simulated and two real world data sets.Conclusions:An empirically derived ME prior seems promising for improving the Bayesian Cl for the unknown error rate of a designed classifier.

[1]  K. Fidelis,et al.  Discovering regulatory binding-site modules using rule-based learning. , 2005, Genome research.

[2]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[3]  Hanna Göransson,et al.  Molecular Markers for Discrimination of Benign and Malignant Follicular Thyroid Tumors , 2006, Tumor Biology.

[4]  R. Larsson,et al.  A rapid fluorometric method for semiautomated determination of cytotoxicity and cellular proliferation of human tumor cell lines in microculture. , 1989, Anticancer research.

[5]  L. Liotta,et al.  Laser capture microdissection. , 2006, Methods in molecular biology.

[6]  T. Mosmann Rapid colorimetric assay for cellular growth and survival: application to proliferation and cytotoxicity assays. , 1983, Journal of immunological methods.

[7]  A. Ohrn,et al.  Rough sets: a knowledge discovery technique for multifactorial medical outcomes. , 2000, American journal of physical medicine & rehabilitation.

[8]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[9]  S. P. Fodor,et al.  Multiplexed biochemical assays with biological chips , 1993, Nature.

[10]  Daniel Pinkel,et al.  Genomic microarrays in human genetic disease and cancer. , 2003, Human molecular genetics.

[11]  Kerby Shedden,et al.  Analysis of cell-cycle gene expression in Saccharomyces cerevisiae using microarrays and multiple synchronization methods , 2002, Nucleic Acids Res..

[12]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[13]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[14]  Anil Potti,et al.  An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer. , 2007, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[15]  G. S. Johnson,et al.  An Information-Intensive Approach to the Molecular Pharmacology of Cancer , 1997, Science.

[16]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[17]  Jae K. Lee,et al.  A strategy for predicting the chemosensitivity of human cancers and its application to drug discovery , 2007, Proceedings of the National Academy of Sciences.

[18]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[19]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[20]  D. Hanahan,et al.  The Hallmarks of Cancer , 2000, Cell.

[21]  Korbinian Strimmer,et al.  Identifying periodically expressed transcripts in microarray time series data , 2008, Bioinform..

[22]  H. Dressman,et al.  Genomic signatures to guide the use of chemotherapeutics , 2006 .

[23]  R. Larsson,et al.  Prediction of individual patient response to chemotherapy by the fluorometric microculture cytotoxicity assay (FMCA) using drug specific cut-off limits and a Bayesian model. , 1993, Anticancer research.

[24]  K. Nilsson,et al.  Anti-cancer drug characterisation using a human cell line panel representing defined types of drug resistance. , 1996, British Journal of Cancer.

[25]  J. Lieb,et al.  ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. , 2004, Genomics.

[26]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[27]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[28]  D A Scudiero,et al.  Feasibility of drug screening with panels of human tumor cell lines using a microculture tetrazolium assay. , 1988, Cancer research.

[29]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[30]  Ilya Shmulevich,et al.  Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data , 2007, BMC Bioinformatics.

[31]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[32]  J. Weinstein,et al.  Transcriptomic analysis of the NCI-60 cancer cell lines. , 2003, Comptes rendus biologies.

[33]  R. T. Cox Probability, frequency and reasonable expectation , 1990 .

[34]  Bartek Wilczynski,et al.  Using local gene expression similarities to discover regulatory binding site modules , 2006, BMC Bioinformatics.

[35]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[36]  A. Isaksson,et al.  Identification of molecular mechanisms for cellular drug resistance by combining drug activity and gene expression profiles , 2005, British Journal of Cancer.

[37]  D. Scudiero,et al.  Development of human tumor cell line panels for use in disease-oriented drug screening. , 1988, Progress in clinical and biological research.

[38]  E. Jaynes Probability theory : the logic of science , 2003 .

[39]  Rainer M. Bohle,et al.  Real-time quantitative RT–PCR after laser-assisted cell picking , 1998, Nature Medicine.

[40]  D. Botstein,et al.  A gene expression database for the molecular pharmacology of cancer , 2000, Nature Genetics.

[41]  Peer Bork,et al.  Comparison of computational methods for the identification of cell cycle-regulated genes , 2005, Bioinform..

[42]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[43]  M. Kendall On the reconciliation of theories of probability. , 1949, Biometrika.

[44]  J. Ross,et al.  Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[45]  Yusuke Nakamura,et al.  An integrated database of chemosensitivity to 55 anticancer drugs and gene expression profiles of 39 human cancer cell lines. , 2002, Cancer research.