A Tree-Based Semi-Varying Coefficient Model for the COM-Poisson Distribution

Abstract We propose a tree-based semi-varying coefficient model for the Conway–Maxwell–Poisson (CMP or COM-Poisson) distribution which is a two-parameter generalization of the Poisson distribution and is flexible enough to capture both under-dispersion and over-dispersion in count data. The advantage of tree-based methods is their scalability to high-dimensional data. We develop CMPMOB, an estimation procedure for a semi-varying coefficient model, using model-based recursive partitioning (MOB). The proposed framework is broader than the existing MOB framework as it allows node-invariant effects to be included in the model. To simplify the computational burden of the exhaustive search employed in the original MOB algorithm, a new split point estimation procedure is proposed by borrowing tools from change point estimation methodology. The proposed method uses only the estimated score functions without fitting models for each split point and, therefore, is computationally simpler. Since the tree-based methods only provide a piece-wise constant approximation to the underlying smooth function, we further propose the CMPBoost semi-varying coefficient model which uses the gradient boosting procedure for estimation. The usefulness of the proposed methods are illustrated using simulation studies and a real example from a bike sharing system in Washington, DC. Supplementary files for this article are available online.

[1]  Achim Zeileis,et al.  A Toolkit for Recursive Partytioning , 2015 .

[2]  Alan Huang,et al.  Mean-parametrized Conway–Maxwell–Poisson regression models for dispersed counts , 2016, 1606.03214.

[3]  Kimberly F. Sellers,et al.  A Flexible Regression Model for Count Data , 2008, 1011.2077.

[4]  Hui Zou,et al.  Insurance Premium Prediction via Gradient Tree-Boosted Tweedie Compound Poisson Models , 2015, 1508.06378.

[5]  A. Zeileis A Unified Approach to Structural Change Tests Based on ML Scores, F Statistics, and OLS Residuals , 2005 .

[6]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[7]  B. Brodsky,et al.  Nonparametric Methods in Change Point Problems , 1993 .

[8]  Gordon J. Ross Parametric and Nonparametric Sequential Change Detection in R: The cpm package , 2012 .

[9]  Xin-Yuan Song,et al.  Local Polynomial Fitting in Semivarying Coefficient Model , 2002 .

[10]  Jianqing Fan,et al.  Simultaneous Confidence Bands and Hypothesis Testing in Varying‐coefficient Models , 2000 .

[11]  Douglas M. Hawkins,et al.  A Change-Point Model for a Shift in Variance , 2005 .

[12]  B. Peter BOOSTING FOR HIGH-DIMENSIONAL LINEAR MODELS , 2006 .

[13]  Enno Mammen,et al.  Varying Coefficient Regression Models: A Review and New Developments , 2015 .

[14]  S. Panchapakesan,et al.  Inference about the Change-Point in a Sequence of Random Variables: A Selection Approach , 1988 .

[15]  B. Boukai A Nonparametric bootstrapped estimate of the change-point , 1993 .

[16]  Jianqing Fan,et al.  Statistical Methods with Varying Coefficient Models. , 2008, Statistics and its interface.

[17]  T. Minka,et al.  A useful distribution for fitting discrete data: revival of the Conway–Maxwell–Poisson distribution , 2005 .

[18]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[19]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[20]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[21]  Peter Buhlmann Boosting for high-dimensional linear models , 2006, math/0606789.

[22]  Hadi Fanaee-T,et al.  Event labeling combining ensemble detectors and background knowledge , 2014, Progress in Artificial Intelligence.

[23]  Trevor Hastie,et al.  Boosted Varying-Coefficient Regression Models for Product Demand Prediction , 2014 .

[24]  Kimberly F. Sellers,et al.  Data Dispersion: Now You See It… Now You Don't , 2013 .

[25]  Stanley R. Johnson,et al.  Varying Coefficient Models , 1984 .

[26]  Yingcun Xia,et al.  Efficient estimation for semivarying‐coefficient models , 2004 .

[27]  David V. Hinkley,et al.  Inference about the change-point in a sequence of binomial variables , 1970 .

[28]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[29]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[30]  Ian W. McKeague,et al.  Confidence sets for split points in decision trees , 2007 .

[31]  W. Loh,et al.  LOTUS: An Algorithm for Building Accurate and Comprehensible Logistic Regression Trees , 2004 .

[32]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[33]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[34]  W. Loh,et al.  REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION , 2002 .

[35]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[36]  Galit Shmueli,et al.  Efficient estimation of COM-Poisson regression and a generalized additive model , 2016, Comput. Stat. Data Anal..

[37]  K. Hornik,et al.  Model-Based Recursive Partitioning , 2008 .

[38]  Noël Veraverbeke,et al.  Change-point problem and bootstrap , 1995 .

[39]  Gilbert Ritschard,et al.  Tree-based varying coefficient regression for longitudinal ordinal responses , 2015, Comput. Stat. Data Anal..