First Principle Models Based Dataset Generation for Multi-Target Regression and Multi-Label Classification Evaluation

Machine Learning and Data Mining research strongly depend on the quality and quantity of the real world datasets for the evaluation stages of the developing methods. In the context of the emerging Online Multi-Target Regression and Multi-Label Classification methodologies, datasets present new characteristics that require specific testing and represent new challenges. The first difficulty found in evaluation is the reduced amount of examples caused by data damage, privacy preservation or high cost of acquirement. Secondly, few data events of interest such as data changes are difficult to find in the datasets of specific domains, since these events naturally scarce. For those reasons, this work suggests a method of producing synthetic datasets with desired properties(number of examples, data changes events, ... ) for the evaluation of Multi-Target Regression and Multi-Label Classification methods. These datasets are produced using First Principle Models which give more realistic and representative properties such as real world meaning ( physical, financial, . . . ) for the outputs and inputs variables. This type of dataset generation can be used to produce infinite streams and to evaluate incremental methods such as online anomaly and change detection. This paper illustrates the use of synthetic data generation through two showcases of data changes evaluation.

[1]  N. Lawrence Ricker,et al.  Decentralized control of the Tennessee Eastman Challenge Process , 1996 .

[2]  Thomas M. Breuel,et al.  A Bayes-true data generator for evaluation of supervised and unsupervised learning methods , 2011, Pattern Recognit. Lett..

[3]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[4]  Ludmila I. Kuncheva,et al.  A framework for generating data to simulate changing environments , 2007, Artificial Intelligence and Applications.

[5]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[6]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[7]  Manuel Rodríguez,et al.  First principles model based control , 2005 .

[8]  Changsheng Li,et al.  MORES: Online Incremental Multiple-Output Regression for Data Streams , 2014, ArXiv.

[9]  Mohieddine Jelali,et al.  Revision of the Tennessee Eastman Process Model , 2015 .

[10]  Alice M. Agogino,et al.  Innovative design of mechanical structures from first principles , 1987, Artif. Intell. Eng. Des. Anal. Manuf..

[11]  Karl Johan Åström,et al.  PID Controllers: Theory, Design, and Tuning , 1995 .

[12]  Newton Spolaôr,et al.  A Framework to Generate Synthetic Multi-label Datasets , 2014, CLEI Selected Papers.

[13]  María Pérez-Ortiz,et al.  An n-Spheres Based Synthetic Data Generator for Supervised Classification , 2013, IWANN.

[14]  Geoff Holmes,et al.  Streaming Multi-label Classification , 2011, WAPA.