Breast cancer prognosis by combinatorial analysis of gene expression data

IntroductionThe potential of applying data analysis tools to microarray data for diagnosis and prognosis is illustrated on the recent breast cancer dataset of van 't Veer and coworkers. We re-examine that dataset using the novel technique of logical analysis of data (LAD), with the double objective of discovering patterns characteristic for cases with good or poor outcome, using them for accurate and justifiable predictions; and deriving novel information about the role of genes, the existence of special classes of cases, and other factors.MethodData were analyzed using the combinatorics and optimization-based method of LAD, recently shown to provide highly accurate diagnostic and prognostic systems in cardiology, cancer proteomics, hematology, pulmonology, and other disciplines.ResultsLAD identified a subset of 17 of the 25,000 genes, capable of fully distinguishing between patients with poor, respectively good prognoses. An extensive list of 'patterns' or 'combinatorial biomarkers' (that is, combinations of genes and limitations on their expression levels) was generated, and 40 patterns were used to create a prognostic system, shown to have 100% and 92.9% weighted accuracy on the training and test sets, respectively. The prognostic system uses fewer genes than other methods, and has similar or better accuracy than those reported in other studies. Out of the 17 genes identified by LAD, three (respectively, five) were shown to play a significant role in determining poor (respectively, good) prognosis. Two new classes of patients (described by similar sets of covering patterns, gene expression ranges, and clinical features) were discovered. As a by-product of the study, it is shown that the training and the test sets of van 't Veer have differing characteristics.ConclusionThe study shows that LAD provides an accurate and fully explanatory prognostic system for breast cancer using genomic data (that is, a system that, in addition to predicting good or poor prognosis, provides an individualized explanation of the reasons for that prognosis for each patient). Moreover, the LAD model provides valuable insights into the roles of individual and combinatorial biomarkers, allows the discovery of new classes of patients, and generates a vast library of biomedical research hypotheses.

[1]  Y. Crama,et al.  Cause-effect relationships and partially defined Boolean functions , 1988 .

[2]  Anna V. Ivshina,et al.  Syndrome approach for computer recognition of fuzzy systems and its application to immunological diagnostics and prognosis of human cancer , 1996 .

[3]  Toshihide Ibaraki,et al.  Logical analysis of numerical data , 1997, Math. Program..

[4]  V A Kuznetsov,et al.  Prognosis of intravesical bacillus Calmette-Guerin therapy for superficial bladder cancer by immunological urinary measurements: statistically weighted syndrome analysis. , 1998, The Journal of urology.

[5]  M. Bittner,et al.  Gene expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays. , 1998, Cancer research.

[6]  P. Törönen,et al.  Analysis of gene expression data using self‐organizing maps , 1999, FEBS letters.

[7]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[8]  S. Hilsenbeck,et al.  Statistical analysis of array expression data as applied to the problem of tamoxifen resistance. , 1999, Journal of the National Cancer Institute.

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[10]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[11]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Peter L. Hammer,et al.  Logical analysis of Chinese labor productivity patterns , 1999, Ann. Oper. Res..

[13]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[14]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Joshua M. Stuart,et al.  MICROARRAY EXPERIMENTS : APPLICATION TO SPORULATION TIME SERIES , 1999 .

[16]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[17]  Toshihide Ibaraki,et al.  An Implementation of Logical Analysis of Data , 2000, IEEE Trans. Knowl. Data Eng..

[18]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[19]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[21]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[23]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[25]  J. Shih,et al.  Global analysis of gene expression in invasion by a lung cancer model. , 2001, Cancer research.

[26]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[27]  Carsten O. Peterson,et al.  Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. , 2001, Cancer research.

[28]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[29]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Samuel Granjeaud,et al.  Prognosis of Breast Cancer and Gene Expression Profiling Using DNA Arrays , 2002, Annals of the New York Academy of Sciences.

[32]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[33]  G. Narasimhan,et al.  Multiple comparisons model-based clustering and ternary pattern tree numerical display of gene response to treatment: procedure and application to the preclinical evaluation of chemopreventive agents. , 2002, Molecular cancer therapeutics.

[34]  Ying Liu,et al.  The Maximum Box Problem and its Application to Data Analysis , 2002, Comput. Optim. Appl..

[35]  Peter L. Hammer,et al.  Use of the Logical Analysis of Data Method for Assessing Long-Term Mortality Risk After Exercise Electrocardiography , 2002, Circulation.

[36]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[37]  Peter L. Hammer,et al.  Coronary Risk Prediction by Logical Analysis of Data , 2003, Ann. Oper. Res..

[38]  Heping Zhang,et al.  Cell and tumor classification using gene expression data: Construction of forests , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[39]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Gerhard Tutz,et al.  A CART-based approach to discover emerging patterns in microarray data , 2003, Bioinform..

[41]  Philip M. Long,et al.  Breast cancer classification and prognosis based on gene expression profiles from a population-based study , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Annuska M Glas,et al.  Gene expression profiles of primary breast tumors maintained in distant metastases , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[43]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[44]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[45]  Li Liu,et al.  Robust singular value decomposition analysis of microarray data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[46]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[47]  E Terry Papoutsakis,et al.  A segmental nearest neighbor normalization and gene identification method gives superior results for DNA-array analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Wei Pan,et al.  Linear regression and two-class classification with gene expression data , 2003, Bioinform..

[49]  M. Cronin,et al.  A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. , 2004, The New England journal of medicine.

[50]  Alexander Kogan,et al.  COUNTRY RISK RATINGS: STATISTICAL AND COMBINATORIAL NON-RECURSIVE MODELS , 2004 .

[51]  P. Hammer,et al.  Ovarian cancer detection by logical analysis of proteomic data , 2004, Proteomics.

[52]  Yudong D. He,et al.  A cell proliferation signature is a marker of extremely poor outcome in a subpopulation of breast cancer patients. , 2005, Cancer research.

[53]  Peter L. Hammer,et al.  Logical Analysis of Data: From Combinatorial Optimization to Medical Applications , 2005 .

[54]  Gabriela Alexe,et al.  A computational approach to predicting cell growth on polymeric biomaterials. , 2005, Journal of biomedical materials research. Part A.

[55]  G Alexe,et al.  Logical analysis of diffuse large B-cell lymphomas , 2005, Artif. Intell. Medicine.

[56]  Weida Tong,et al.  Multi-class cancer classification by total principal component regression (TPCR) using microarray gene expression data , 2005, Nucleic acids research.

[57]  Peter L. Hammer,et al.  Logical analysis of data—An overview: From combinatorial optimization to medical applications , 2006, Ann. Oper. Res..

[58]  Peter L. Hammer,et al.  Accelerated algorithm for pattern detection in logical analysis of data , 2006, Discret. Appl. Math..

[59]  Peter L. Hammer,et al.  Spanned patterns for the logical analysis of data , 2006, Discret. Appl. Math..

[60]  Peter L. Hammer,et al.  Pattern-based feature selection in genomics and proteomics , 2006, Ann. Oper. Res..

[61]  Peter L. Hammer,et al.  Logical Analysis of Computed Tomography Data to Differentiate Entities of Idiopathic Interstitial Pneumonias , 2007 .