Classification tree algorithm for grouped variables

We consider the problem of predicting a categorical variable based on groups of inputs. Some methods have already been proposed to elaborate classification rules based on groups of variables (e.g. group lasso for logistic regression). However, to our knowledge, no tree-based approach has been proposed to tackle this issue. Here, we propose the Tree Penalized Linear Discriminant Analysis algorithm (TPLDA), a new-tree based approach which constructs a classification rule based on groups of variables. It consists in splitting a node by repeatedly selecting a group and then applying a regularized linear discriminant analysis based on this group. This process is repeated until some stopping criterion is satisfied. A pruning strategy is proposed to select an optimal tree. Compared to the existing multivariate classification tree methods, the proposed method is computationally less demanding and the resulting trees are more easily interpretable. Furthermore, TPLDA automatically provides a measure of importance for each group of variables. This score allows to rank groups of variables with respect to their ability to predict the response and can also be used to perform group variable selection. The good performances of the proposed algorithm and its interest in terms of prediction accuracy, interpretation and group variable selection are loud and compared to alternative reference methods through simulations and applications on real datasets.

[1]  Jun Ni,et al.  Clustering of gene expression data: performance and similarity analysis , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[2]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[3]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[4]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[5]  Simon Kasif,et al.  OC1: A Randomized Induction of Oblique Decision Trees , 1993, AAAI.

[6]  J. Friedman Regularized Discriminant Analysis , 1989 .

[7]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[8]  Ping Xu,et al.  Modified linear discriminant analysis approaches for classification of high-dimensional microarray data , 2009, Comput. Stat. Data Anal..

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  J. Shao,et al.  Sparse linear discriminant analysis by thresholding for high dimensional data , 2011, 1105.3561.

[11]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[12]  S. Batzoglou,et al.  Application of independent component analysis to microarrays , 2003, Genome Biology.

[13]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[14]  J. Mesirov,et al.  Metagene projection for cross-platform, cross-species characterization of global transcriptional states , 2007, Proceedings of the National Academy of Sciences.

[15]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[16]  Roberta Siciliano,et al.  Discriminant Analysis and Factorial Multiple Splits in Recursive Partitioning for Data Mining , 2002, Multiple Classifier Systems.

[17]  Adam P. Arkin,et al.  OpWise: Operons aid the identification of differentially expressed genes in bacterial microarray experiments , 2005, BMC Bioinformatics.

[18]  P. Utgoff,et al.  Multivariate Decision Trees , 1995, Machine Learning.

[19]  W. Loh,et al.  Tree-Structured Classification via Generalized Discriminant Analysis. , 1988 .

[20]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[21]  Nir Friedman,et al.  Practical approaches to analyzing results of microarray experiments. , 2002, American journal of respiratory cell and molecular biology.

[22]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[23]  Russ B. Altman,et al.  Independent component analysis: Mining microarray data for fundamental human gene expression modules , 2010, J. Biomed. Informatics.

[24]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[25]  James T. C. Teng,et al.  Multivariate decision trees using linear discriminants and tabu search , 2003, IEEE Trans. Syst. Man Cybern. Part A.

[26]  C. J. Price,et al.  HHCART: An oblique decision tree , 2015, Comput. Stat. Data Anal..

[27]  Carla E. Brodley,et al.  Multivariate decision trees , 2004, Machine Learning.

[28]  Jian Huang,et al.  A Selective Review of Group Selection in High-Dimensional Models. , 2012, Statistical science : a review journal of the Institute of Mathematical Statistics.

[29]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[30]  Robin Genuer,et al.  Arbres CART et For{\^e}ts al{\'e}atoiresImportance et s{\'e}lection de variables , 2016, 1610.08203.

[31]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[32]  Quentin Grimonprez,et al.  MLGL: An R package implementing correlated variable selection by hierarchical clustering and group-Lasso , 2018 .

[33]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[34]  Wei Pan,et al.  Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data , 2007, Bioinform..

[35]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[36]  Jiawei Han,et al.  Training Linear Discriminant Analysis in Linear Time , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[37]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[38]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[39]  Bertrand Michel,et al.  Grouped variable importance with random forests and application to multiple functional data analysis , 2014, Comput. Stat. Data Anal..

[40]  C. Schmid,et al.  High-Dimensional Discriminant Analysis , 2005 .

[41]  Zhong-Hui Duan,et al.  Gene Expression Based Leukemia Sub-Classification Using Committee Neural Networks , 2009, Bioinformatics and biology insights.

[42]  Yu Quan,et al.  Comparison of linear discriminant analysis methods for the classification of cancer based on gene expression data , 2009, Journal of experimental & clinical cancer research : CR.

[43]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[44]  Victor Picheny,et al.  Interpretable sparse SIR for functional data , 2016, Statistics and Computing.

[45]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[46]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[47]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[48]  R. Tibshirani,et al.  Penalized classification using Fisher's linear discriminant , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[49]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Wei-Yin Loh,et al.  Fifty Years of Classification and Regression Trees , 2014 .