Speeding up the discovery of combinations of differentially expressed genes for disease prediction and classification

BACKGROUND AND OBJECTIVE Finding combinations (i.e., pairs, or more generally, q-tuples with q ≥ 2) of genes whose behavior as a group differs significantly between two classes has received a lot of attention in the quest for the discovery of simple, accurate, and easily interpretable decision rules for disease classification and prediction. For example, the Top Scoring Pair (TSP) method seeks to find pairs of genes so that the probability of the reversal of the relative ranking of the expression levels of the genes in the two classes is maximized. The computational cost of finding a q-tuple of genes that scores highest under a given metric is O(Gq), where G is the total number of genes. This cost is often problematic or prohibitive in practice (even for q=2), as the number of genes G is often in the order of tens of thousands. METHODS In this paper, we show that this computational cost can be significantly reduced by excluding from consideration genes whose behavior is almost identical in the two classes and therefore their inclusion in any q-tuple is rather non-informative. Our criterion for the exclusion of genes is supported by a statistically robust metric, the Area Under the Curve (AUC) of the corresponding Receiver Operating Characteristic (ROC) curve. By filtering out genes whose AUC value is below a user-chosen threshold, as determined by a procedure that we describe in the paper, dramatic reductions in the run times are obtained while maintaining the same classification accuracy. RESULTS We have experimentally verified the gains of this approach on several case studies involving ovarian, colon, leukemia, breast and prostate cancers, and diffuse large b-cell lymphoma. CONCLUSIONS The proposed method is not only faster (for example, we observed an average 78.65% reduction over the run time of TSP) while maintaining the same classification accuracy, but it can even result in better classification accuracy due to its inherent ability to avoid the so-called "pivot" (non-informative) genes that may intrude in q-tuples chosen otherwise.

[1]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[2]  M S Pepe,et al.  Phases of biomarker development for early detection of cancer. , 2001, Journal of the National Cancer Institute.

[3]  G. Tseng,et al.  Comprehensive literature review and statistical considerations for GWAS meta-analysis , 2012, Nucleic acids research.

[4]  Shenglin Huang,et al.  [Gfi-1 expression in leukemia patients and inhibitory effects of lentiviral vector mediated silence of Gfi-1 gene on proliferation in K562 cells]. , 2010, Zhongguo shi yan xue ye xue za zhi.

[5]  U. Langsenlehner,et al.  A polymorphism in the G protein β3-subunit gene is associated with bone metastasis risk in breast cancer patients , 2008, Breast Cancer Research and Treatment.

[6]  M Kathleen Kerr,et al.  Design considerations for efficient and effective microarray studies. , 2003, Biometrics.

[7]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[8]  John Quackenbush,et al.  Computational genetics: Computational analysis of microarray data , 2001, Nature Reviews Genetics.

[9]  I. Ellis,et al.  Expression of mucins (MUC1, MUC2, MUC3, MUC4, MUC5AC and MUC6) and their prognostic significance in human breast cancer , 2005, Modern Pathology.

[10]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[11]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Zhijun Dai,et al.  TSG: a new algorithm for binary and multi-class cancer classification and informative genes selection , 2013, BMC Medical Genomics.

[13]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[14]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[15]  B. Paulweber,et al.  The 825C>T polymorphism of the G-protein beta-3 subunit gene (GNB3) and breast cancer. , 2004, Cancer letters.

[16]  Nathan D. Price,et al.  The top-scoring ‘N’ algorithm: a generalized relative expression classification method from small numbers of biomolecules , 2012, BMC Bioinformatics.

[17]  G. Sauter,et al.  Estrogen receptor alpha (ESR1) gene amplification is frequent in breast cancer , 2007, Nature Genetics.

[18]  D. Slonim From patterns to pathways: gene expression data analysis comes of age , 2002, Nature Genetics.

[19]  M. Barchitta,et al.  Association between high expression of natural killer related-genes (NCAM/CD94) and early death during induction in children with acute myeloid leukemia , 2008, Leukemia.

[20]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[21]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[22]  Barry Komm,et al.  Profiling of estrogen up- and down-regulated gene expression in human breast cancer cells: insights into gene networks and pathways underlying estrogenic control of proliferation and cell phenotype. , 2003, Endocrinology.

[23]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Victor Treviño,et al.  Comparison of gene expression patterns across twelve tumor types identifies a cancer supercluster characterized by TP53 mutations and cell cycle defects , 2014, Oncogene.

[25]  J. Claverie Computational methods for the identification of differential and coordinated gene expression. , 1999, Human molecular genetics.

[26]  M. Morley,et al.  Making and reading microarrays , 1999, Nature Genetics.

[27]  M. Schummer,et al.  Selecting Differentially Expressed Genes from Microarray Experiments , 2003, Biometrics.

[28]  J. Albanese,et al.  Patterns of spectrin expression in B-cell lymphomas: loss of spectrin isoforms is associated with nodule-forming and germinal center-related lymphomas , 2007, Modern Pathology.

[29]  David J. Hand,et al.  ROC Curves for Continuous Data , 2009 .

[30]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[31]  Su-zhan Zhang,et al.  Overexpression of Shp2 tyrosine phosphatase is implicated in leukemogenesis in adult human leukemia. , 2005, Blood.

[32]  John Quackenbush Microarray analysis and tumor classification. , 2006, The New England journal of medicine.

[33]  Marc D. H. Hansen,et al.  Imatinib restores VASP activity and its interaction with Zyxin in BCR-ABL leukemic cells. , 2015, Biochimica et biophysica acta.

[34]  David R. Bickel,et al.  Degrees of differential gene expression: detecting biologically significant expression differences and estimating their magnitudes , 2004, Bioinform..

[35]  G. Smyth,et al.  Microarray background correction: maximum likelihood estimation for the normal–exponential convolution , 2008, Biostatistics.

[36]  C. McMahon,et al.  Spectrin isoforms: differential expression in normal hematopoiesis and alterations in neoplastic bone marrow disorders. , 2011, American journal of clinical pathology.

[37]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[38]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[39]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[40]  Rainer Breitling,et al.  Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments , 2004, FEBS letters.

[41]  L. Elferink,et al.  Checks and balances: interplay of RTKs and PTPs in cancer progression. , 2011, Biochemical pharmacology.

[42]  L. Penland,et al.  Use of a cDNA microarray to analyse gene expression patterns in human cancer , 1996, Nature Genetics.

[43]  L. V. van't Veer,et al.  Gene expression signature to improve prognosis prediction of stage II and III colorectal cancer. , 2011, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[44]  M. Cronin,et al.  Gene expression and benefit of chemotherapy in women with node-negative, estrogen receptor-positive breast cancer. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[45]  Tsviya Olender,et al.  VarElect: the phenotype-based variation prioritizer of the GeneCards Suite , 2016, BMC Genomics.

[46]  M. Chan,et al.  In vivo inhibition of nitric oxide synthase gene expression by curcumin, a cancer preventive natural product with anti-inflammatory properties. , 1998, Biochemical pharmacology.

[47]  Q. Ye,et al.  Serum deprivation confers the MDA-MB-231 breast cancer line with an EGFR/JAK3/PLD2 system that maximizes cancer cell invasion. , 2013, Journal of molecular biology.

[48]  Dimitrios Kagaris,et al.  AUCTSP: an improved biomarker gene pair class predictor , 2018, BMC Bioinformatics.

[49]  Emanuel F. Petricoin,et al.  Medical applications of microarray technologies: a regulatory science perspective , 2002, Nature Genetics.

[50]  Daniel Q. Naiman,et al.  The ordering of expression among a few genes can provide simple cancer biomarkers and signal BRCA1 mutations , 2009, BMC Bioinformatics.

[51]  Juliane Fluck,et al.  Microarrays: How Many Do You Need? , 2003, J. Comput. Biol..

[52]  Daniel Q. Naiman,et al.  Simple decision rules for classifying human cancers from gene expression profiles , 2005, Bioinform..

[53]  Daniel Q. Naiman,et al.  Classifying Gene Expression Profiles from Pairwise mRNA Comparisons , 2004, Statistical applications in genetics and molecular biology.

[54]  S. Singletary Rating the Risk Factors for Breast Cancer , 2003, Annals of surgery.