Contrasting case-wise deletion with multiple imputation and latent variable approaches to dealing with missing observations in count regression models

Missing data can lead to biased and inefficient parameter estimates in statistical models, depending on the missing data mechanism. Count regression models are no exception, with missing data leading to incorrect inferences about the effects of explanatory variables. A convenient approach for dealing with missing data is to remove observations with incomplete records prior to the analysis - often referred to as case-wise deletion. Removing incomplete records, however, reduces the sample size, increases standard errors and, if data are not missing completely at random, produces biased parameter estimates. A more complex approach is multiple imputation, which provides an estimate of the modelling uncertainty created by the data ‘missing-ness’, as distinct from the natural variation in the data. However, multiple imputation produces biased parameter estimates if the probability of missing data is related to the observed data - or is endogenous. Latent variable modelling has recently been introduced as an alternative approach for dealing with missing data, but it comes at a high computational cost and complexity. Despite fairly extensive methodological advancements in statistical literature, case-wise deletion is commonly employed to deal with missing data in statistical models of transport, while the multiple imputation and latent variable approaches remain relatively unexplored. More importantly, the performance of these approaches has not been tested across different types of data missing-ness. To address these gaps, this study aims to contrast case-wise deletion with multiple imputation and latent variable approaches in dealing with missing data in count regression models. We compare the performance of these three approaches using crash count models estimated against empirical data obtained from state controlled roads in Queensland, Australia. A quasi-experimental evaluation of data missing-ness is then conducted by extracting three data subsets from the original dataset, each with a unique missing data mechanism (with terminology adopted from the statistical literature): missing completely at random, missing at random, and missing not at random. The three approaches are then applied to each data subset and the results are compared in terms of bias, precision of parameter estimates, and goodness-of-fit. The findings indicate that multiple imputation is the most effective approach when data are missing either completely at random or at random, whereas the latent variable approach is more effective when data are missing not at random. However, the effectiveness of the latent variable approach is dependent on the availability of suitable variables as instruments in the data.

[1]  Srinivas Reddy Geedipally,et al.  Analyzing Different Parameterizations of the Varying Dispersion Parameter as a Function of Segment Length , 2009 .

[2]  S. Washington,et al.  Statistical and Econometric Methods for Transportation Data Analysis , 2010 .

[3]  Craig K. Enders,et al.  Applied Missing Data Analysis , 2010 .

[4]  Roderick J. A. Little,et al.  Multiple Imputation for the Fatal Accident Reporting System , 1991 .

[5]  E Hauer,et al.  EXTENT AND SOME IMPLICATIONS OF INCOMPLETE ACCIDENT REPORTING , 1988 .

[6]  Jared S. Murray,et al.  Multiple Imputation: A Review of Practical and Theoretical Findings , 2018, 1801.04058.

[7]  Noelia Caceres,et al.  Estimating traffic volumes on intercity road locations using roadway attributes, socioeconomic features and other work-related activity characteristics , 2018 .

[8]  Joan L. Walker Extended discrete choice models : integrated framework, flexible error structures, and latent variables , 2001 .

[9]  Murat K. Munkin,et al.  Simulated maximum likelihood estimation of multivariate mixed‐Poisson regression models, with application , 1999 .

[10]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[11]  Nalini Ravishanker,et al.  Selecting exposure measures in crash rate prediction for two-lane highway segments. , 2004, Accident; analysis and prevention.

[12]  Dick T. Apronti,et al.  Estimating Traffic Volume on Wyoming Low Volume Roads Using Linear and Logistic Regression Methods , 2016 .

[13]  Nokil Park,et al.  Estimation of average annual daily traffic (AADT) using geographically weighted regression (GWR) method and geographic information system (GIS) , 2004 .

[14]  Kim S. Sankey,et al.  Relationships between young drivers' personality characteristics, risk perceptions, and driving behaviour. , 2008, Accident; analysis and prevention.

[15]  C. Bhat Quasi-random maximum simulated likelihood estimation of the mixed multinomial logit model , 2001 .

[16]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[17]  William T. Scherer,et al.  Exploring Imputation Techniques for Missing Data in Transportation Management Systems , 2003 .

[18]  H. Chin,et al.  Application of Poisson Underreporting Model to Examine Crash Frequencies at Signalized Three-Legged Intersections , 2005 .

[19]  Fred L Mannering,et al.  A note on modeling vehicle accident frequencies with random-parameters count models. , 2009, Accident; analysis and prevention.

[20]  Fred L. Mannering,et al.  Negative binomial analysis of intersection accident frequencies , 1996 .

[21]  Andrew Daly,et al.  Contrasting imputation with a latent variable approach to dealing with missing income in choice models , 2014 .

[22]  Maurizio Guida,et al.  A crash-prediction model for multilane roads. , 2007, Accident; analysis and prevention.

[23]  Fred L. Mannering,et al.  The statistical analysis of crash-frequency data: A review and assessment of methodological alternatives , 2010 .

[24]  Simon Washington,et al.  A comprehensive joint econometric model of motor vehicle crashes arising from multiple sources of risk , 2018 .

[25]  Chandra R. Bhat,et al.  Imputing a continuous income variable from grouped and missing income observations , 1994 .

[26]  Luis F. Miranda-Moreno,et al.  Effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter of Poisson-gamma models for modeling motor vehicle crashes: a Bayesian perspective , 2008 .

[27]  John N. Ivan,et al.  Hierarchical Bayesian Estimation of Safety Performance Functions for Two-Lane Highways Using Markov Chain Monte Carlo Modeling , 2005 .

[28]  Dominique Lord,et al.  Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory. , 2005, Accident; analysis and prevention.

[29]  D. Hensher,et al.  Bayesian imputation of non-chosen attribute values in revealed preference surveys , 2009 .

[30]  Ming Zhong,et al.  Genetically Designed Models for Accurate Imputation of Missing Traffic Counts , 2004 .

[31]  Joan L. Walker,et al.  Hybrid Choice Models: Progress and Challenges , 2002 .

[32]  S. F. Buck A Method of Estimation of Missing Values in Multivariate Data Suitable for Use with an Electronic Computer , 1960 .

[33]  Donald B. Rubin,et al.  The Design of a General and Flexible System for Handling Nonresponse in Sample Surveys , 2004 .

[34]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[35]  Dominique Lord,et al.  The statistical analysis of highway crash-injury severities: a review and assessment of methodological alternatives. , 2011, Accident; analysis and prevention.

[36]  D. Brownstone,et al.  Estimating Commuters' "Value of Time" with Noisy Data: a Multiple Imputation Approach , 2004 .

[37]  Simon Washington,et al.  Bayesian Latent Class Safety Performance Function for Identifying Motor Vehicle Crash Black Spots , 2016 .

[38]  K. Land,et al.  An Empirical Evaluation of the Predictive Mean Matching Method for Imputing Missing Values , 1997 .

[39]  M. Forster,et al.  Key Concepts in Model Selection: Performance and Generalizability. , 2000, Journal of mathematical psychology.

[40]  S. S. Wilks Moments and Distributions of Estimates of Population Parameters from Fragmentary Samples , 1932 .

[41]  Priyanka Alluri,et al.  Estimating Annual Average Daily Traffic for Local Roads for Highway Safety Analysis , 2013 .

[42]  Dominique Lord,et al.  Modeling motor vehicle crashes using Poisson-gamma models: examining the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter. , 2006, Accident; analysis and prevention.

[43]  Emmanuelle Amoros,et al.  Under-reporting of road crash casualties in France. , 2006, Accident; analysis and prevention.

[44]  Simon Washington,et al.  On the significance of omitted variables in intersection crash modeling. , 2012, Accident; analysis and prevention.

[45]  Jianming Ma P.E. Bayesian Analysis of Underreporting Poisson Regression Model with an Application to Traffic Crashes on Two-Lane Highways , 2009 .

[46]  Bhagwant Persaud,et al.  Accident Prediction Models With and Without Trend: Application of the Generalized Estimating Equations Procedure , 2000 .

[47]  Sigal Kaplan,et al.  Understanding traffic crash under-reporting: Linking police and medical records to individual and crash characteristics , 2014, Traffic injury prevention.

[48]  Young-Jun Kweon,et al.  Driver injury severity: an application of ordered probit models. , 2002, Accident; analysis and prevention.

[49]  Denis Bolduc,et al.  On estimation of Hybrid Choice Models , 2009 .

[50]  M. Zhong,et al.  ESTIMATION OF MISSING TRAFFIC COUNTS USING FACTOR, GENETIC, NEURAL AND REGRESSION TECHNIQUES , 2004 .

[51]  Chandra R. Bhat,et al.  Unobserved heterogeneity and the statistical analysis of highway accident data , 2016 .

[52]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[53]  Kara M. Kockelman,et al.  Location Choice vis-à-vis Transportation , 2006 .