DPPred: An Effective Prediction Framework with Concise Discriminative Patterns

In the literature, two series of models have been proposed to address prediction problems including classification and regression. Simple models, such as generalized linear models, have ordinary performance but strong interpretability on a set of simple features. The other series, including tree-based models, organize numerical, categorical, and high dimensional features into a comprehensive structure with rich interpretable information in the data. In this paper, we propose a novel Discriminative Pattern-based Prediction framework (<inline-formula> <tex-math notation="LaTeX">$\sf {DPPred}$</tex-math><alternatives><inline-graphic xlink:href="shang-ieq1-2757476.gif"/> </alternatives></inline-formula>) to accomplish the prediction tasks by taking their advantages of both effectiveness and interpretability. Specifically, <inline-formula><tex-math notation="LaTeX">$\sf {DPPred}$</tex-math><alternatives> <inline-graphic xlink:href="shang-ieq2-2757476.gif"/></alternatives></inline-formula> adopts the concise discriminative patterns that are on the prefix paths from the root to leaf nodes in the tree-based models. <inline-formula> <tex-math notation="LaTeX">$\sf {DPPred}$</tex-math><alternatives><inline-graphic xlink:href="shang-ieq3-2757476.gif"/> </alternatives></inline-formula> selects a limited number of the useful discriminative patterns by searching for the most effective pattern combination to fit generalized linear models. Extensive experiments show that in many scenarios, <inline-formula><tex-math notation="LaTeX">$\sf {DPPred}$</tex-math><alternatives> <inline-graphic xlink:href="shang-ieq4-2757476.gif"/></alternatives></inline-formula> provides competitive accuracy with the state-of-the-art as well as the valuable interpretability for developers and experts. In particular, taking a clinical application dataset as a case study, our <inline-formula><tex-math notation="LaTeX">$\sf {DPPred}$</tex-math> <alternatives><inline-graphic xlink:href="shang-ieq5-2757476.gif"/></alternatives></inline-formula> outperforms the baselines by using only 40 concise discriminative patterns out of a potentially exponentially large set of patterns.

[1]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[2]  U. Fayyad,et al.  On the handling of continuous-valued attributes in decision tree generation , 2004, Machine Learning.

[3]  H. Keselman,et al.  Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables , 1992 .

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[6]  Jinyan Li,et al.  CAEP: Classification by Aggregating Emerging Patterns , 1999, Discovery Science.

[7]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[8]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[9]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[10]  Betty Y. S. Kim,et al.  Minocycline inhibits cytochrome c release and delays progression of amyotrophic lateral sclerosis in mice , 2002, Nature.

[11]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[12]  G. Karypis,et al.  Frequent sub-structure-based approaches for classifying chemical compounds , 2005, Third IEEE International Conference on Data Mining.

[13]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Yuji Matsumoto,et al.  An Application of Boosting to Graph Classification , 2004, NIPS.

[16]  Robert G. Miller,et al.  Placebo-controlled phase I/II studies of minocycline in amyotrophic lateral sclerosis , 2004, Neurology.

[17]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[18]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[19]  Anthony K. H. Tung,et al.  Mining top-K covering rule groups for gene expression data , 2005, SIGMOD '05.

[20]  George Karypis,et al.  Frequent substructure-based approaches for classifying chemical compounds , 2003, IEEE Transactions on Knowledge and Data Engineering.

[21]  Jianyong Wang,et al.  HARMONY: Efficiently Mining the Best Rules for Classification , 2005, SDM.

[22]  Mohammed J. Zaki,et al.  Lazy Associative Classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[23]  Gary Geunbae Lee,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006, Inf. Process. Manag..

[24]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[25]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[26]  Philip S. Yu,et al.  Direct mining of discriminative and essential frequent patterns via model-based search tree , 2008, KDD.

[27]  Philip S. Yu,et al.  Direct Discriminative Pattern Mining for Effective Classification , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[28]  Celine Vens,et al.  Random Forest Based Feature Induction , 2011, 2011 IEEE 11th International Conference on Data Mining.

[29]  Yutaka Kuroda,et al.  DROP: an SVM domain linker predictor trained with optimal features selected by random forest , 2011, Bioinform..

[30]  Cristina V. Lopes,et al.  Bagging gradient-boosted trees for high precision, low variance ranking models , 2011, SIGIR.

[31]  Johannes Gehrke,et al.  Intelligible models for classification and regression , 2012, KDD.

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  Johannes Gehrke,et al.  Accurate intelligible models with pairwise interactions , 2013, KDD.

[34]  Josephine Sullivan,et al.  Discriminative tree-based feature mapping , 2013, BMVC.

[35]  Johann S. Hawe,et al.  Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression , 2014, Nature Biotechnology.

[36]  Jian Sun,et al.  Global refinement of random forest , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Guozhu Dong,et al.  Pattern Aided Classification , 2016, SDM.

[38]  Jiawei Han,et al.  DPClass: An Effective but Concise Discriminative Patterns-Based Classification Framework , 2016, SDM.

[39]  Bruce R. Schatz,et al.  Mining Discriminative Patterns to Predict Health Status for Cardiopulmonary Patients , 2016, BCB.