LASSO variable selection in data envelopment analysis with small datasets

Abstract The curse of dimensionality problem arises when a limited number of observations are used to estimate a high-dimensional frontier, in particular, by data envelopment analysis (DEA). The study conducts a data generating process (DGP) to argue the typical “rule of thumb” used in DEA, e.g. the required number of observations should be at least larger than twice of the number of inputs and outputs, is ambiguous and will produce large deviations in estimating the technical efficiency. To address this issue, we propose a Least Absolute Shrinkage and Selection Operator (LASSO) variable selection technique, which is usually used in data science for extracting significant factors, and combine it in a sign-constrained convex nonparametric least squares (SCNLS), which can be regarded as DEA estimator. Simulation results demonstrate that the proposed LASSO-SCNLS method and its variants provide useful guidelines for the DEA with small datasets.

[1]  Boaz Golany,et al.  Including principal component weights to improve discrimination in data envelopment analysis , 2002, J. Oper. Res. Soc..

[2]  Timo Kuosmanen,et al.  Data Envelopment Analysis as Nonparametric Least-Squares Regression , 2010, Oper. Res..

[3]  Jesús T. Pastor,et al.  A Statistical Test for Nested Radial Dea Models , 2002, Oper. Res..

[4]  C. Hildreth Point Estimates of Ordinates of Concave Functions , 1954 .

[5]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[6]  Timo Kuosmanen Representation Theorem for Convex Nonparametric Least Squares , 2008, Econometrics Journal.

[7]  Chia-Yen Lee,et al.  Mutually-exclusive-and-collectively-exhaustive feature selection scheme , 2017, Appl. Soft Comput..

[8]  Yaakov Roll,et al.  An application procedure for DEA , 1989 .

[9]  R. Bellman Dynamic programming. , 1957, Science.

[10]  Cláudia S. Sarrico,et al.  Pitfalls and protocols in DEA , 2001, Eur. J. Oper. Res..

[11]  Huan Liu,et al.  Feature selection for classification: A review , 2014 .

[12]  B. Sen,et al.  A Computational Framework for Multivariate Convex Regression and Its Variants , 2015, Journal of the American Statistical Association.

[13]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[14]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[15]  William W. Cooper,et al.  Introduction to Data Envelopment Analysis and Its Uses: With Dea-Solver Software and References , 2005 .

[16]  T. Ueda,et al.  APPLICATION OF PRINCIPAL COMPONENT ANALYSIS FOR PARSIMONIOUS SUMMARIZATION OF DEA INPUTS AND/OR OUTPUTS , 1997 .

[17]  Emmanuel Thanassoulis,et al.  Applied data envelopment analysis , 1991 .

[18]  Leâ Opold Simar,et al.  A general methodology for bootstrapping in non-parametric frontier models , 2000 .

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  W. Bowlin,et al.  Measuring Performance: An Introduction to Data Envelopment Analysis (DEA) , 1998 .

[21]  Timo Kuosmanen,et al.  A more efficient algorithm for Convex Nonparametric Least Squares , 2013, Eur. J. Oper. Res..

[22]  John S. Liu,et al.  Data envelopment analysis 1978-2010: A citation-based literature survey , 2013 .

[23]  Timo Kuosmanen,et al.  Modeling joint production of multiple outputs in StoNED: Directional distance function approach , 2017, Eur. J. Oper. Res..

[24]  S. Afriat Efficiency Estimation of Production Function , 1972 .

[25]  Andrew L. Johnson,et al.  Guidelines for using variable selection techniques in data envelopment analysis , 2011, Eur. J. Oper. Res..

[26]  Gerard V. Trunk,et al.  A Problem of Dimensionality: A Simple Example , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  John Ruggiero,et al.  Impact Assessment of Input Omission on Dea , 2005, Int. J. Inf. Technol. Decis. Mak..

[28]  Paul W. Wilson,et al.  Dimension reduction in nonparametric models of production , 2017, Eur. J. Oper. Res..

[29]  John S. Liu,et al.  Research fronts in data envelopment analysis , 2016 .

[30]  Nicole Adler,et al.  Improving discrimination in data envelopment analysis: PCA-DEA or variable reduction , 2010, Eur. J. Oper. Res..

[31]  Léopold Simar,et al.  Introducing Environmental Variables in Nonparametric Frontier Models: a Probabilistic Approach , 2005 .

[32]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[33]  Timo Kuosmanen,et al.  Stochastic non-smooth envelopment of data: semi-parametric frontier estimation subject to shape constraints , 2012 .

[34]  Irene Song,et al.  Joint Variable Selection for Data Envelopment Analysis via Group Sparsity , 2014, 1402.3740.

[35]  Holger Fröhlich,et al.  Linking metabolic network features to phenotypes using sparse group lasso , 2017, Bioinform..

[36]  D. Aigner,et al.  P. Schmidt, 1977,?Formulation and estimation of stochastic frontier production function models,? , 1977 .