LASSO+DEA for small and big wide data

Abstract In data envelopment analysis (DEA), the curse of dimensionality problem may jeopardize the accuracy or even the relevance of results when there is a relatively large dimension of inputs and outputs, even for relatively large samples. Recently, a machine learning approach based on the least absolute shrinkage and selection operator (LASSO) for variable selection was combined with sign-constrained convex nonparametric least squares (SCNLS, a special case of DEA), and dubbed as LASSO-SCNLS, as a way to circumvent the curse of dimensionality problem. In this paper, we revisit this interesting approach, by considering various data generating processes. We also explore a more advanced version of LASSO, the so-called elastic net (EN) approach, adapt it to DEA and propose the EN-DEA. Our Monte Carlo simulations provide additional and to some extent, new evidence and conclusions. In particular, we find that none of the considered approaches clearly dominate the others. To circumvent the curse of dimensionality of DEA in the context of big wide data, we also propose a simplified two-step approach which we call LASSO+DEA. We find that the proposed simplified approach could be more useful than the existing more sophisticated approaches for reducing very large dimensions into sparser, more parsimonious DEA models that attain greater discriminatory power and suffer less from the curse of dimensionality.

[1]  Abraham Charnes,et al.  Measuring the efficiency of decision making units , 1978 .

[2]  M. Farrell The Measurement of Productive Efficiency , 1957 .

[3]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[4]  Timo Kuosmanen,et al.  Data Envelopment Analysis as Nonparametric Least-Squares Regression , 2010, Oper. Res..

[5]  E. Seijo,et al.  Nonparametric Least Squares Estimation of a Multivariate Convex Regression Function , 2010, 1003.4765.

[6]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[7]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[8]  Yi Guo,et al.  Artificial Intelligence and Machine Learning in Bioinformatics , 2019, Encyclopedia of Bioinformatics and Computational Biology.

[9]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[10]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[11]  P. W. Wilson,et al.  Estimation and inference in two-stage, semi-parametric models of production processes , 2007 .

[12]  A. Tikhonov On the stability of inverse problems , 1943 .

[13]  T. Koopmans,et al.  Activity Analysis of Production and Allocation. , 1952 .

[14]  Mike G. Tsionas,et al.  Smooth approximations to monotone concave functions in production analysis: An alternative to nonparametric concave least squares , 2018, Eur. J. Oper. Res..

[15]  Ya Chen,et al.  A hybrid data envelopment analysis approach to analyse college graduation rate at higher education institutions , 2017, INFOR Inf. Syst. Oper. Res..

[16]  Jie Wu,et al.  Efficiency evaluation based on data envelopment analysis in the big data context , 2017, Comput. Oper. Res..

[17]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[18]  Chia-Yen Lee,et al.  LASSO variable selection in data envelopment analysis with small datasets , 2020 .

[19]  Timo Kuosmanen,et al.  Stochastic non-smooth envelopment of data: semi-parametric frontier estimation subject to shape constraints , 2012 .

[20]  A. Charnes,et al.  Some Models for Estimating Technical and Scale Inefficiencies in Data Envelopment Analysis , 1984 .

[21]  Paul A. Buhler,et al.  Big Data Fundamentals: Concepts, Drivers & Techniques , 2015 .

[22]  Timo Kuosmanen,et al.  Stochastic Nonparametric Envelopment of Data: Combining Virtues of Sfa and DEA in a Unified Framework , 2006 .

[23]  M. Foster An Application of the Wiener-Kolmogorov Smoothing Theory to Matrix Inversion , 1961 .

[24]  R. Tibshirani,et al.  Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso , 2017, 1707.08692.

[25]  Arthur E. Hoerl,et al.  Application of ridge analysis to regression problems , 1962 .

[26]  Mervin E. Muller,et al.  A note on a method for generating points uniformly on n-dimensional spheres , 1959, CACM.

[27]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[28]  Juan Aparicio,et al.  The curse of dimensionality of decision-making units: A simple approach to increase the discriminatory power of data envelopment analysis , 2019, Eur. J. Oper. Res..

[29]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[30]  Nicolai Meinshausen,et al.  Relaxed Lasso , 2007, Comput. Stat. Data Anal..

[31]  Sendhil Mullainathan,et al.  Machine Learning: An Applied Econometric Approach , 2017, Journal of Economic Perspectives.

[32]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[33]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[34]  Rob Kitchin,et al.  What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets , 2016, Big Data Soc..

[35]  Mehdi Toloo,et al.  Data envelopment analysis and big data , 2019, Eur. J. Oper. Res..

[36]  Timo Kuosmanen,et al.  Stochastic non-convex envelopment of data: Applying isotonic regression to frontier estimation , 2013, Eur. J. Oper. Res..

[37]  Joe Zhu,et al.  DEA under big data: data enabled analytics and network data envelopment analysis , 2020, Ann. Oper. Res..

[38]  Timo Kuosmanen,et al.  Modeling joint production of multiple outputs in StoNED: Directional distance function approach , 2017, Eur. J. Oper. Res..

[39]  Qing Liu,et al.  Detecting Projected Outliers in High-Dimensional Data Streams , 2009, DEXA.

[40]  A. Belloni,et al.  SPARSE MODELS AND METHODS FOR OPTIMAL INSTRUMENTS WITH AN APPLICATION TO EMINENT DOMAIN , 2012 .

[41]  David B. Dunson,et al.  Bayesian Compressed Regression , 2013, ArXiv.

[42]  John S. Liu,et al.  Research fronts in data envelopment analysis , 2016 .

[43]  A. Belloni,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011, 1201.0224.

[44]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[45]  L. Simar,et al.  Stochastic FDH/DEA estimators for frontier analysis , 2008 .

[46]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[47]  Joe Zhu,et al.  Analyzing Performance of Service Organizations , 2013 .

[48]  Valentin Zelenyuk,et al.  Aggregation of inputs and outputs prior to Data Envelopment Analysis under big data , 2020, Eur. J. Oper. Res..

[49]  G. Marsaglia Choosing a Point from the Surface of a Sphere , 1972 .

[50]  G. Debreu The Coefficient of Resource Utilization , 1951 .

[51]  John S. Liu,et al.  Data envelopment analysis 1978-2010: A citation-based literature survey , 2013 .

[52]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[53]  Yao Chen,et al.  DEANN: A healthcare analytic methodology of data envelopment analysis and artificial neural networks for the prediction of organ recipient functional status , 2016 .

[54]  Paul W. Wilson,et al.  Dimension reduction in nonparametric models of production , 2017, Eur. J. Oper. Res..

[55]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[56]  Susan Athey,et al.  Machine Learning and Causal Inference for Policy Evaluation , 2015, KDD.

[57]  José H. Dulá,et al.  DEA with streaming data , 2013 .

[58]  Lawrence M. Seiford,et al.  Data envelopment analysis (DEA) - Thirty years on , 2009, Eur. J. Oper. Res..

[59]  Ali Emrouznejad,et al.  A survey and analysis of the first 40 years of scholarly literature in DEA: 1978–2016 , 2018 .