Windowing improvements towards more comprehensible models

We propose several improvements for the windowing algorithm.We evaluated model performance, interpretability, and stability.Our methodology focuses on the interpretability of the model.Our approach shows differences in terms of interpretability, without harming performance.Our approach may yield better classification models. The induction of decision tree searches for relevant characteristics in the data which would allow it to precisely model a certain concept, but it also worries about the comprehensibility of the generated model, helping human specialists to discover new knowledge, something very important in the medical and biological areas. On the other hand, such inducers present some instability. The main problem handled here refers to the behavior of those inducers when it comes to high-dimensional data, more specifically to gene expression data: irrelevant attributes may harm the learning process and many models with similar performance may be generated. In order to treat those problems, we have explored and revised windowing: pruning of the trees generated during intermediary steps of the algorithm; the use of the estimated error instead of the training error; the use of the error weighted according to the size of the current window; and the use of the classification confidence as the window update criterion. The results show that the proposed algorithm outperform the classical one, especially considering measures of complexity and comprehensibility of the induced models.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  R. Aharonov,et al.  MicroRNAs accurately identify cancer tissue origin , 2008, Nature Biotechnology.

[3]  R. Salunga,et al.  Gene expression profiles of human breast cancer progression , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Jiawei Han,et al.  Feature selection using dynamic weights for classification , 2013, Knowl. Based Syst..

[5]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Eric P Hoffman,et al.  GRB14, GPD1, and GDF8 as potential network collaborators in weight loss-induced improvements in insulin action in human skeletal muscle. , 2006, Physiological genomics.

[7]  P S Meltzer,et al.  Gastrointestinal stromal tumors with KIT mutations exhibit a remarkably homogeneous gene expression profile. , 2001, Cancer research.

[8]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[9]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[10]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[11]  Peter D. Turney Technical note: Bias and the quantification of stability , 1995, Machine Learning.

[12]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[13]  Pedro M. Domingos Efficient Specific-to-General Rule Induction , 1996, KDD.

[14]  Johannes Fürnkranz,et al.  Noise-Tolerant Windowing , 1997, IJCAI.

[15]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[16]  Thomas Reinartz,et al.  A Unifying View on Instance Selection , 2002, Data Mining and Knowledge Discovery.

[17]  Albert Fornells,et al.  A study of the effect of different types of noise on the precision of supervised learning techniques , 2010, Artificial Intelligence Review.

[18]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[19]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[20]  Calum MacAulay,et al.  Effect of active smoking on the human bronchial epithelium transcriptome , 2007, BMC Genomics.

[21]  Damien Chaussabel,et al.  Unique gene expression profiles of human macrophages and dendritic cells to phylogenetically distinct parasites. , 2003, Blood.

[22]  Mark Last,et al.  Improving Stability of Decision Trees , 2002, Int. J. Pattern Recognit. Artif. Intell..

[23]  Nada Lavrac,et al.  Induction of comprehensible models for gene expression datasets by subgroup discovery methodology , 2004, J. Biomed. Informatics.

[24]  Calum MacAulay,et al.  Up regulation in gene expression of chromatin remodelling factors in cervical intraepithelial neoplasia , 2008, BMC Genomics.

[25]  M. Hagberg Editorial , 2004 .

[26]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[27]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[28]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[29]  Johannes Fürnkranz,et al.  More Efficient Windowing , 1997, AAAI/IAAI.

[30]  Alessandra Alaniz Macedo,et al.  Applying Decision Trees to Gene Expression Data from DNA Microarrays: A Leukemia Case Study , 2010 .

[31]  Roger E Bumgarner,et al.  Correction: Multiclass classification of microarray data with repeated measurements: application to cancer , 2006, Genome Biology.

[32]  Gerald Schaefer,et al.  Fuzzy Classification for Gene Expression Data Analysis , 2008, Computational Intelligence in Bioinformatics.

[33]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[34]  Masahiro Yamamura,et al.  Use of Genetic Profiling in Leprosy to Discriminate Clinical Forms of the Disease , 2003, Science.

[35]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[36]  James Bailey,et al.  ROC-tree: A Novel Decision Tree Induction Algorithm Based on Receiver Operating Characteristics to Classify Gene Expression Data , 2008, SDM.

[37]  Daniel Q. Naiman,et al.  Simple decision rules for classifying human cancers from gene expression profiles , 2005, Bioinform..

[38]  Bart Baesens,et al.  Performance of classification models from a user perspective , 2011, Decis. Support Syst..

[39]  E. Southern,et al.  Oligonucleotide hybridizations on glass supports: a novel linker for oligonucleotide synthesis and hybridization properties of oligonucleotides synthesised in situ. , 1992, Nucleic acids research.

[40]  Peter A. Flach,et al.  Rule Evaluation Measures: A Unifying View , 1999, ILP.

[41]  Cathal Seoighe,et al.  Seq-ing improved gene expression estimates from microarrays using machine learning , 2015, BMC Bioinformatics.

[42]  José Augusto Baranauskas,et al.  Analysis of Decision Tree Pruning Using Windowing in Medical Datasets with Di erent Class Distributions , 2011 .

[43]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[44]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[45]  D. Botstein,et al.  For Personal Use. Only Reproduce with Permission from the Lancet Publishing Group , 2022 .

[46]  Peter Clark,et al.  Rule Induction with CN2: Some Recent Improvements , 1991, EWSL.

[47]  Miao Sun,et al.  Gene expression profiles in acute myeloid leukemia with common translocations using SAGE. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[48]  M. Tyers,et al.  Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. , 2002, Cancer research.

[49]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[50]  Qinghua Hu,et al.  Robust feature selection based on regularized brownboost loss , 2013, Knowl. Based Syst..

[51]  Rui Li,et al.  Phospholipase A2 group IIA expression in gastric adenocarcinoma is associated with prolonged survival and less frequent metastasis , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Meland,et al.  THE USE OF MOLECULAR PROFILING TO PREDICT SURVIVAL AFTER CHEMOTHERAPY FOR DIFFUSE LARGE-B-CELL LYMPHOMA , 2002 .

[53]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[55]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[56]  K. Boon,et al.  Molecular Phenotypes Distinguish Patients with Relatively Stable from Progressive Idiopathic Pulmonary Fibrosis (IPF) , 2009, PloS one.

[57]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[58]  Raj Chari,et al.  Transcriptome Profiles of Carcinoma-in-Situ and Invasive Non-Small Cell Lung Cancer as Revealed by SAGE , 2010, PloS one.

[59]  M. Massink,et al.  Molecular classification of familial breast cancer , 2015 .

[60]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[61]  Sung-Bae Cho,et al.  Cancer classification using ensemble of neural networks with multiple significant gene subsets , 2007, Applied Intelligence.

[62]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[63]  Li Li,et al.  Discovery of time-delayed gene regulatory networks based on temporal gene expression profiling , 2006, BMC Bioinformatics.

[64]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[65]  A. Butte,et al.  Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[66]  D. Lockhart,et al.  Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[67]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.