Overcoming missing data bias in water utility indicators by using nested balanced panels

Abstract This paper demonstrates a methodology for calculating trends in unbalanced panel nonrandom sample datasets, using the International Benchmarking Network for Water and Sanitation Utilities (IBNET) dataset on more than 5000 utilities. The methodology can be used for any dataset and calculates the change, or delta, between the same unit of observation (in this case, a utility) over two consecutive years and nests these deltas to calculate an average trend for a given variable over the longest time horizon possible. We use this method to show trends in water utilities’ performance between 2004 and 2015 at a global level and to reveal differences in performance between groups of utilities. For the sake of comprehensiveness, the representativeness of IBNET is also discussed to provide more context to the dataset used. A probit analysis, conducted to shed light on the representativeness of utilities in the IBNET dataset over time, reveals that the utilities that reported their data in earlier years, in general, have a higher number of connections and perform better than the utilities that reported their data in later years. This implies that over the years, as the number of utilities reporting their data increases, more utilities outside of the bigger (more connections) and better performing utilities start reporting. In other words, in the earlier years it is the bigger and better performing utilities that first report data. In the later years, the smaller and not so well performing utilities also start reporting their data.

[1]  Y. Haitovsky Missing Data in Regression Analysis , 1968 .

[2]  B. Baltagi,et al.  Econometric Analysis of Panel Data , 2020, Springer Texts in Business and Economics.

[3]  Emmanuel Curis,et al.  Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors , 2018, BMC Medical Research Methodology.

[4]  Cheng Hsiao,et al.  Analysis of Panel Data , 1987 .

[5]  W. Dixon Simplified Estimation from Censored Normal Samples , 1960 .

[6]  Badi H. Baltagi,et al.  Unbalanced panel data: A survey , 2006 .

[7]  Myeongsu Kang,et al.  Machine Learning: Anomaly Detection , 2018 .

[8]  J. Heckman Sample selection bias as a specification error , 1979 .

[9]  F. Pukelsheim The Three Sigma Rule , 1994 .

[10]  Andrew Pickles,et al.  Missing Data, Problems and Solutions , 2003 .

[11]  Erik Meijer,et al.  Comments on: Panel data analysis—advantages and challenges , 2007 .

[12]  Cheng Hsiao,et al.  Panel Data Analysis - Advantages and Challenges , 2006 .

[13]  David C. Howell,et al.  The Treatment of Missing Data , 2007 .

[14]  T. Ferryman,et al.  Data outlier detection using the Chebyshev theorem , 2005, 2005 IEEE Aerospace Conference.

[15]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[16]  Marno Verbeek,et al.  Incomplete panels and selection bias , 1995 .

[17]  Luis A. Andrés,et al.  Uncovering the Drivers of Utility Performance: Lessons from Latin America and the Caribbean on the Role of the Private Sector, Regulation, and Governance in the Power, Water, and Telecommunication Sectors , 2013 .

[18]  Myeongsu Kang,et al.  Introduction to PHM , 2018, Prognostics and Health Management of Electronics.