Predictive Power Estimation Algorithm (PPEA) - A New Algorithm to Reduce Overfitting for Genomic Biomarker Discovery

Toxicogenomics promises to aid in predicting adverse effects, understanding the mechanisms of drug action or toxicity, and uncovering unexpected or secondary pharmacology. However, modeling adverse effects using high dimensional and high noise genomic data is prone to over-fitting. Models constructed from such data sets often consist of a large number of genes with no obvious functional relevance to the biological effect the model intends to predict that can make it challenging to interpret the modeling results. To address these issues, we developed a novel algorithm, Predictive Power Estimation Algorithm (PPEA), which estimates the predictive power of each individual transcript through an iterative two-way bootstrapping procedure. By repeatedly enforcing that the sample number is larger than the transcript number, in each iteration of modeling and testing, PPEA reduces the potential risk of overfitting. We show with three different cases studies that: (1) PPEA can quickly derive a reliable rank order of predictive power of individual transcripts in a relatively small number of iterations, (2) the top ranked transcripts tend to be functionally related to the phenotype they are intended to predict, (3) using only the most predictive top ranked transcripts greatly facilitates development of multiplex assay such as qRT-PCR as a biomarker, and (4) more importantly, we were able to demonstrate that a small number of genes identified from the top-ranked transcripts are highly predictive of phenotype as their expression changes distinguished adverse from nonadverse effects of compounds in completely independent tests. Thus, we believe that the PPEA model effectively addresses the over-fitting problem and can be used to facilitate genomic biomarker discovery for predictive toxicology and drug responses.

[1]  B. Mougin,et al.  Peptidylpropyl isomerase B (PPIB): a suitable reference gene for mRNA quantification in peripheral whole blood. , 2004, Journal of biotechnology.

[2]  J. Ozer,et al.  The current state of serum biomarkers of hepatotoxicity. , 2008, Toxicology.

[3]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[4]  I. Bross How to use Ridit Analysis , 1958 .

[5]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[6]  Gert R. G. Lanckriet,et al.  Classification of a large microarray data set: algorithm comparison and analysis of drug signatures. , 2005, Genome research.

[7]  Edward R. Dougherty,et al.  The peaking phenomenon in the presence of feature-selection , 2008, Pattern Recognit. Lett..

[8]  P. Seeburg,et al.  Tyrosine kinase receptor with extensive homology to EGF receptor shares chromosomal location with neu oncogene. , 1985, Science.

[9]  Xuegong Zhang,et al.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data , 2006, BMC Bioinformatics.

[10]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[11]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[12]  J. Pascussi,et al.  Xenoreceptors CAR and PXR activation and consequences on lipid metabolism, glucose homeostasis, and inflammatory response. , 2008, Molecular pharmaceutics.

[13]  Chao Sima,et al.  Performance of Feature Selection Methods , 2009, Current genomics.

[14]  Chuan Lu,et al.  An investigation into the population abundance distribution of mRNAs, proteins, and metabolites in biological systems , 2009, Bioinform..

[15]  Nicoletta Dessì,et al.  An evolutionary method for combining different feature selection criteria in microarray data classification , 2009 .

[16]  M. Fielden,et al.  A Gene Expression Signature that Predicts the Future Onset of Drug-Induced Renal Tubular Toxicity , 2005, Toxicologic pathology.

[17]  J. Eun,et al.  Discriminating the molecular basis of hepatotoxicity using the large-scale characteristic molecular signatures of toxicants by expression profiling analysis. , 2008, Toxicology.

[18]  A. Nobel,et al.  Concordance among Gene-Expression – Based Predictors for Breast Cancer , 2011 .

[19]  D. Adams,et al.  Mechanisms of immune-mediated liver injury. , 2010, Toxicological Sciences.

[20]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[21]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[22]  Lue Ping Zhao,et al.  Phenotypic anchoring of global gene expression profiles induced by N-hydroxy-4-acetylaminobiphenyl and benzo[a]pyrene diol epoxide reveals correlations between expression profiles and mechanism of toxicity. , 2005, Chemical research in toxicology.

[23]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[24]  P. Bork,et al.  Proteome survey reveals modularity of the yeast cell machinery , 2006, Nature.

[25]  Gary W. Donaldson Ridit scores for analysis and interpretation of ordinal pain data , 1998, European journal of pain.

[26]  D. Ransohoff Bias as a threat to the validity of cancer molecular-marker research , 2005, Nature reviews. Cancer.

[27]  F. Sistare,et al.  Preclinical Predictors of Clinical Safety: Opportunities for Improvement , 2007, Clinical pharmacology and therapeutics.

[28]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[29]  M. Fielden,et al.  Development of a large-scale chemogenomics database to improve drug candidate selection and to understand mechanisms of chemical toxicity and action. , 2005, Journal of biotechnology.

[30]  J. E. Peterson Biliary Hyperplasia and Carcinogenesis in Chronic Liver Damage Induced in Rats by Phomopsin , 1990, Pathology.

[31]  M. Fielden,et al.  The liver pharmacological and xenobiotic gene response repertoire , 2008, Molecular systems biology.

[32]  J. Stevens,et al.  Strategic applications of toxicogenomics in early drug discovery. , 2008, Current opinion in pharmacology.

[33]  Monilola A. Olayioye,et al.  Update on HER-2 as a target for cancer therapy: Intracellular signaling pathways of ErbB2/HER-2 and family members , 2001, Breast Cancer Research.

[34]  D. Ransohoff Rules of evidence for cancer molecular-marker discovery and validation , 2004, Nature Reviews Cancer.

[35]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[36]  Shibing Deng,et al.  Validation of rat reference genes for improved quantitative gene expression analysis using low density arrays. , 2007, BioTechniques.

[37]  Sanmay Das,et al.  Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection , 2001, ICML.

[38]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[39]  Charles E McCulloch,et al.  Relaxing the rule of ten events per variable in logistic and Cox regression. , 2007, American journal of epidemiology.

[40]  Philip G Hewitt,et al.  Acute hepatotoxicity: a predictive model based on focused illumina microarrays. , 2007, Toxicological sciences : an official journal of the Society of Toxicology.

[41]  Tieliu Shi,et al.  Consistency of predictive signature genes and classifiers generated using different microarray platforms , 2010, The Pharmacogenomics Journal.

[42]  D. Mendrick,et al.  Genomic and genetic biomarkers of toxicity. , 2008, Toxicology.

[43]  R. E. Wilson,et al.  Blood gene expression signatures predict exposure levels , 2007, Proceedings of the National Academy of Sciences.

[44]  A. Dunker,et al.  Identification of a gene signature in cell cycle pathway for breast cancer prognosis using gene expression profiling data , 2008, BMC Medical Genomics.