Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond

Much work on skewed, stochastic, high dimensional, and biased datasets usually implicitly solve each problem separately. Recently, we have been approached by Texas Commission on Environmental Quality (TCEQ) to help them build highly accurate ozone level alarm forecasting models for the Houston area, where these technical difficulties come together in one single problem. Key characteristics of this problem that is challenging and interesting include: (1) the dataset is sparse (72 features, and 2 or 5% positives depending on the criteria of “ozone days”), (2) evolving over time from year to year, (3) limited in collected data size (7  years or around 2,500 data entries), (4) contains a large number of irrelevant features, (5) is biased in terms of “sample selection bias”, and (6) the true model is stochastic as a function of measurable factors. Besides solving a difficult application problem, this dataset offers a unique opportunity to explore new and existing data mining techniques, and to provide experience, guidance and solution for similar problems. Our main technical focus addresses on how to estimate reliable probability given both sample selection bias and a large number of irrelevant features, and how to choose the most reliable decision threshold to predict the unknown future with different distribution. On the application side, the prediction accuracy of our chosen approach (bagging probabilistic decision trees and random decision trees) is 20% higher in recall (correctly detects 1–3 more ozone days, depending on the year) and 10% higher in precision (15–30 fewer false alarm days per year) than state-of-the-art methods used by air quality control scientists, and these results are significant for TCEQ. On the technical side of data mining, extensive empirical results demonstrate that, at least for this problem, and probably other problems with similar characteristics, these two straight-forward non-parametric methods can provide significantly more accurate and reliable solutions than a number of sophisticated and well-known algorithms, such as SVM and AdaBoost among many others.

[1]  Hiroshi Mamitsuka,et al.  Query-learning-based iterative feature-subset selection for learning from high-dimensional data sets , 2005, Knowledge and Information Systems.

[2]  S. Ortega,et al.  Evaluation of two ozone air quality modelling systems , 2004 .

[3]  Ian Davidson,et al.  Reverse testing: an efficient framework to select amongst classifiers under sample selection bias , 2006, KDD '06.

[4]  Bert Brunekreef,et al.  Air Pollution Exposure in Europe—Assessment in the ESCAPE study , 2009 .

[5]  Philip S. Yu,et al.  Data Mining: How Research Meets Practical Development? , 2003, Knowledge and Information Systems.

[6]  C. Ling,et al.  Decision Tree with Better Ranking , 2003, ICML.

[7]  Yongdai Kim,et al.  Convex Hull Ensemble Machine for Regression and Classification , 2003, Knowledge and Information Systems.

[8]  L. Mark Berliner,et al.  A hierarchical Bayesian model to estimate and forecast ozone through space and time , 2005 .

[9]  Gavin C. Cawley,et al.  A rigorous inter-comparison of ground-level ozone predictions , 2003 .

[10]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[11]  Peter A. Flach,et al.  Decision Trees for Ranking: Effect of new smoothing methods, new splitting criteria and simple pruning methods , 2003 .

[12]  Christian Ghiaus,et al.  Linear fuzzy-discriminant analysis applied to forecast ozone concentration classes in sea-breeze regime , 2003 .

[13]  Clayton D. Forswall,et al.  Clean Air Act Implementation in Houston: An Historical Perspective 1970-2005 , 2005 .

[14]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[15]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[16]  Brent R. Young,et al.  Fuzzy logic modeling of surface ozone concentrations , 2005, Comput. Chem. Eng..

[17]  Philip S. Yu,et al.  Is random model better? On its accuracy and efficiency , 2003, Third IEEE International Conference on Data Mining.

[18]  T. S. Dye,et al.  Guideline for developing an ozone forecasting program , 1999 .

[19]  Kun Zhang,et al.  Learning through changes: an empirical study of dynamic behaviors of probability estimation trees , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[20]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[21]  Ian Davidson,et al.  When Efficient Model Averaging Out-Performs Boosting and Bagging , 2006, PKDD.