Regression imputation optimizing sample size and emulation: Demonstrations and comparisons to prominent methods

Abstract Missing input values weaken the ability of information systems (IS) researchers to make calculations, thereby reducing effective sample sizes and statistical power. Such technical problems with data cascade into scientific limitations resulting in the neglect of social and economic issues. Therefore, extensive missing values in data forces researchers to make crucial decisions, such as whether to impute and if so, what strategy to use. This study presents a single imputation approach that integrates and extends best practices for mitigating the effects of missing values. Using an array of missing value situations, we illustrate the Regression Imputation Optimizing Sample Size and Emulation (RIOSSE) method. The approach involves the derivation of an imputation model for each low-sample variable that leverages information available in large-sample sized inputs within the same data source. RIOSSE derives imputation equations with two competing goals in mind: 1) statistical power and 2) emulation. Direct comparisons demonstrate that RIOSSE is superior to three prominent multiple imputation methods (K-Nearest Neighbor, missForest, and LASSO) in two criteria each for achieving statistical power (parsimoniousness and sample size) and emulation (predictiveness and content validity). Further, 5-fold cross validation validated the head-to-head goal criteria comparisons. The paper contributes 1) a description of the RIOSSE method, 2) new imputation performance metrics and visualizations, 3) comparisons of our proposed method to three prominent multiple imputation methods, and 4) specified imputation models for 30 commonly used inputs to firm performance calculations.

[1]  A. Zwinderman,et al.  Validation of prediction models based on lasso regression with multiply imputed data , 2014, BMC Medical Research Methodology.

[2]  Melissa E. Graebner,et al.  Grand Challenges and Inductive Methods: Rigor without Rigor Mortis , 2016 .

[3]  Jeffrey M. Wooldridge,et al.  Introductory Econometrics: A Modern Approach , 1999 .

[4]  Shehroz S. Khan,et al.  Bootstrapping and multiple imputation ensemble approaches for classification problems , 2019, J. Intell. Fuzzy Syst..

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  James T. C. Teng,et al.  Research Note - Do Large Firms Become Smaller by Using Information Technology? , 2013, Inf. Syst. Res..

[7]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[8]  Nathan W. Twyman,et al.  Robustness of Multiple Indicators in Automated Screening Systems for Deception Detection , 2015, J. Manag. Inf. Syst..

[9]  Nigel Melville,et al.  Research Note - Generating Shareable Statistical Databases for Business Value: Multiple Imputation with Multimodal Perturbation , 2012, Inf. Syst. Res..

[10]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[11]  Stephan Ludwig,et al.  Untangling a Web of Lies: Exploring Automated Detection of Deception in Computer-Mediated Communication , 2016, J. Manag. Inf. Syst..

[12]  Jui Ramaprasad,et al.  Social Media, Traditional Media, and Music Sales , 2014, MIS Q..

[13]  Andreas I. Nicolaou Research methodologies in AIS , 2013, Int. J. Account. Inf. Syst..

[14]  Paul A. Pavlou,et al.  Does Information and Communication Technology Lead to the Well-Being of Nations? A Country-Level Empirical Investigation , 2015, MIS Q..

[15]  Michael J. Braunscheidel,et al.  Software Piracy and Intellectual Property Rights Protection , 2013 .

[16]  Marta Indulska,et al.  Do Ontological Deficiencies in Modeling Grammars Matter? , 2011, MIS Q..

[17]  Sally K. Widener,et al.  The Performance Effects of Using Business Intelligence Systems for Exploitation and Exploration Learning , 2016, J. Inf. Syst..

[18]  P. Allison Multiple Imputation for Missing Data , 2000 .

[19]  G. Molenberghs,et al.  A multiple regression imputation method with application to sensitivity analysis under intermittent missingness , 2020, Communications in Statistics - Theory and Methods.

[20]  Anindya Ghose,et al.  The Internet and Racial Hate Crime: Offline Spillovers from Online Access , 2015, MIS Q..

[21]  Fei Ren,et al.  Industry-Level Analysis of Information Technology Return and Risk: What Explains the Variation? , 2015, J. Manag. Inf. Syst..

[22]  P. Bentler,et al.  ML Estimation of Mean and Covariance Structures with Missing Data Using Complete Data Routines , 1999 .

[23]  Rainer Leonhart,et al.  Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research , 2012, BMC Medical Research Methodology.

[24]  Kei Hirose,et al.  Readouts for echo-state networks built using locally regularized orthogonal forward regression , 2011 .

[25]  James B. Pick,et al.  A Global Model of Technological Utilization Based on Governmental, Business-Investment, Social, and Economic Factors , 2011, J. Manag. Inf. Syst..

[26]  S. Sterba Cautions on the Use of Multiple Imputation When Selecting Between Latent Categorical versus Continuous Models for Psychological Constructs , 2016, Journal of clinical child and adolescent psychology : the official journal for the Society of Clinical Child and Adolescent Psychology, American Psychological Association, Division 53.

[27]  M. P. Gómez-Carracedo,et al.  A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets , 2014 .

[28]  Alexander Kogan,et al.  Using XBRL to Conduct a Large-Scale Study of Discrepancies between the Accounting Numbers in Compustat and SEC 10-K Filings , 2015, J. Inf. Syst..

[29]  Prasanna Tambe,et al.  The Productivity of Information Technology Investments: New Evidence from IT Labor Data , 2011, Inf. Syst. Res..

[30]  S. Penman,et al.  FINANCIAL STATEMENT ANALYSIS AND THE PREDICTION OF STOCK RETURNS , 1989 .

[31]  Daniel E. O'Leary,et al.  Event Study Methodologies in Information Systems Research , 2011, Int. J. Account. Inf. Syst..

[32]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[33]  R. Lyman Ott.,et al.  An introduction to statistical methods and data analysis , 1977 .

[34]  Ashraf Ahmed,et al.  How do MIS researchers handle missing data in survey-based research: A content analysis approach , 2013, Int. J. Inf. Manag..

[35]  Erik Brynjolfsson,et al.  Valuing Information Technology Related Intangible Assets , 2016, MIS Q..

[36]  Graham J. Williams,et al.  Data Mining , 2000, Communications in Computer and Information Science.

[37]  S. Kuppuswami,et al.  A novel approach for imputation of missing continuous attribute values in databases using genetic algorithm , 2015, Int. J. Inf. Technol. Manag..

[38]  Vijay Gurbaxani,et al.  Investigating the Risk-Return Relationship of Information Technology Investment: Firm-Level Empirical Analysis , 2007, Manag. Sci..

[39]  James R. Marsden,et al.  Numerical data quality in IS research and the implications for replication , 2018, Decis. Support Syst..

[40]  Alexander Robitzsch,et al.  Multiple imputation of missing covariate values in multilevel models with random slopes: a cautionary note , 2015, Behavior Research Methods.

[41]  Christophe Crambes,et al.  Regression imputation in the functional linear model with missing values in the response , 2019, Journal of Statistical Planning and Inference.

[42]  S. F. Buck A Method of Estimation of Missing Values in Multivariate Data Suitable for Use with an Electronic Computer , 1960 .

[43]  Wanda J. Orlikowski,et al.  The Problem of Statistical Power in MIS Research , 1989, MIS Q..

[44]  W. Cooley,et al.  Multivariate Data Analysis. , 1973 .

[45]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[46]  William Lewis,et al.  A Multicollinearity and Measurement Error Statistical Blind Spot: Correcting for Excessive False Positives in Regression and PLS , 2017, MIS Q..

[47]  J. Neter,et al.  Applied Linear Regression Models , 1983 .