A methodology to design heuristics for model selection based on the characteristics of data: Application to investigate when the Negative Binomial Lindley (NB-L) is preferred over the Negative Binomial (NB).

Safety analysts usually use post-modeling methods, such as the Goodness-of-Fit statistics or the Likelihood Ratio Test, to decide between two or more competitive distributions or models. Such metrics require all competitive distributions to be fitted to the data before any comparisons can be accomplished. Given the continuous growth in introducing new statistical distributions, choosing the best one using such post-modeling methods is not a trivial task, in addition to all theoretical or numerical issues the analyst may face during the analysis. Furthermore, and most importantly, these measures or tests do not provide any intuitions into why a specific distribution (or model) is preferred over another (Goodness-of-Logic). This paper ponders into these issues by proposing a methodology to design heuristics for Model Selection based on the characteristics of data, in terms of descriptive summary statistics, before fitting the models. The proposed methodology employs two analytic tools: (1) Monte-Carlo Simulations and (2) Machine Learning Classifiers, to design easy heuristics to predict the label of the 'most-likely-true' distribution for analyzing data. The proposed methodology was applied to investigate when the recently introduced Negative Binomial Lindley (NB-L) distribution is preferred over the Negative Binomial (NB) distribution. Heuristics were designed to select the 'most-likely-true' distribution between these two distributions, given a set of prescribed summary statistics of data. The proposed heuristics were successfully compared against classical tests for several real or observed datasets. Not only they are easy to use and do not need any post-modeling inputs, but also, using these heuristics, the analyst can attain useful information about why the NB-L is preferred over the NB - or vice versa- when modeling data.

[1]  Fred L. Mannering,et al.  Latent Class Analysis of the Effects of Age, Gender, and Alcohol Consumption on Driver-Injury Severities , 2014 .

[2]  Chandra R. Bhat,et al.  Analytic methods in accident research: Methodological frontier and future directions , 2014 .

[3]  Fred L. Mannering,et al.  The statistical analysis of crash-frequency data: A review and assessment of methodological alternatives , 2010 .

[4]  Dominique Lord,et al.  A Monte-Carlo simulation analysis for evaluating the severity distribution functions (SDFs) calibration methodology and determining the minimum sample-size requirements. , 2017, Accident; analysis and prevention.

[5]  Trevor Hastie,et al.  The elements of statistical learning. 2001 , 2001 .

[6]  Srinivas Reddy Geedipally,et al.  The negative binomial-Lindley distribution as a tool for analyzing crash data characterized by a large amount of zeros. , 2011, Accident; analysis and prevention.

[7]  Chandra R. Bhat,et al.  Unobserved heterogeneity and the statistical analysis of highway accident data , 2016 .

[8]  David Lindley,et al.  Fiducial Distributions and Bayes' Theorem , 1958 .

[9]  Srinivas Reddy Geedipally,et al.  Sample-size guidelines for recalibrating crash prediction models: Recommendations for the highway safety manual. , 2016, Accident; analysis and prevention.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Jean-Marie Cornuet,et al.  ABC model choice via random forests , 2014, 1406.6288.

[12]  Shaw-Pin Miaou,et al.  Modeling Traffic Crash-Flow Relationships for Intersections: Dispersion Parameter, Functional Form, and Bayes Versus Empirical Bayes Methods , 2003 .

[13]  Noriszura Ismail,et al.  Negative Binomial-Lindley Distribution and Its Application , 2010 .

[14]  Srinivas R. Geedipally,et al.  A semiparametric negative binomial generalized linear model for modeling over-dispersed count data with a heavy tail: Characteristics and applications to crash data. , 2016, Accident; analysis and prevention.

[15]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[16]  Srinivas Reddy Geedipally,et al.  Application of the Conway-Maxwell-Poisson generalized linear model for analyzing motor vehicle crashes. , 2008, Accident; analysis and prevention.

[17]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[18]  Srinivas Reddy Geedipally,et al.  Improved Guidelines for Estimating the Highway Safety Manual Calibration Factors , 2016 .

[19]  Srinivas Reddy Geedipally,et al.  The negative binomial-Lindley generalized linear model: characteristics and application using crash data. , 2012, Accident; analysis and prevention.