Instance Selection and Outlier Generation to Improve the Cascade Classifier Precision

Classification of high-dimensional time series with imbalanced classes is a challenging task. For such classification tasks, the cascade classifier has been proposed. The cascade classifier tackles high-dimensionality and imbalance by splitting the classification task into several low-dimensional classification tasks and aggregating the intermediate results. Therefore the high-dimensional data set is projected onto low-dimensional subsets. But these subsets can employ unfavorable and not representative data distributions, that hamper classifiction again. Data preprocessing can overcome these problems. Small improvements in the low-dimensional data subsets of the cascade classifier lead to an improvement of the aggregated overall results. We present two data preprocessing methods, instance selection and outlier generation. Both methods are based on point distances in low-dimensional space. The instance selection method selects representative feasible examples and the outlier generation method generates artificial infeasible examples near the class boundary. In an experimental study, we analyse the precision improvement of the cascade classifier due to the presented data preprocessing methods for power production time series of a micro Combined Heat and Power plant and an artificial and complex data set. The precision increase is due to an increased selectivity of the learned decision boundaries. This paper is an extended version of [19], where we have proposed the two data preprocessing methods. In this paper we extend the analysis of both algorithms by a parameter sensitivity analysis of the distance parameters from the preprocessing methods. Both distance parameters depend on each other and have to be chosen carefully. We study the influence of these distance parameters on the classification precision of the cascade model and derive parameter fitting rules for the \(\mu \)CHP data set. The experiments yield a region of optimal parameter value combinations leading to a high classification precision.

[1]  Oliver Kramer,et al.  Improving Cascade Classifier Precision by Instance Selection and Outlier Generation , 2016, ICAART.

[2]  Huan Liu,et al.  Instance Selection and Construction for Data Mining , 2001 .

[3]  L. Zhuang,et al.  Parameter optimization of Kernel-based one-class classifier on imbalance text learning , 2006 .

[4]  Jack P. C. Kleijnen Design and Analysis of Simulation Experiments , 2007 .

[5]  Robert P. W. Duin,et al.  Uniform Object Generation for Optimizing One-class Classifiers , 2002, J. Mach. Learn. Res..

[6]  Marek Grochowski,et al.  Comparison of Instances Seletion Algorithms I. Algorithms Survey , 2004, ICAISC.

[7]  Oliver Kramer,et al.  Classification Cascades of Overlapping Feature Ensembles for Energy Time Series Data , 2015, DARE.

[8]  Radhika Dhingra,et al.  Sensitivity analysis of infectious disease models: methods, advances and their application , 2013, Journal of The Royal Society Interface.

[9]  Paulo Cortez,et al.  Opening black box Data Mining models using Sensitivity Analysis , 2011, 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[10]  D. Hamby A review of techniques for parameter sensitivity analysis of environmental models , 1994, Environmental monitoring and assessment.

[11]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[12]  Henrik Brohus,et al.  Application of sensitivity analysis in design of sustainable buildings , 2009 .

[13]  James J. Chen,et al.  Class-imbalanced classifiers for high-dimensional data , 2013, Briefings Bioinform..

[14]  Marcin Blachnik,et al.  Ensembles of Instance Selection Methods based on Feature Subset , 2014, KES.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Jason Lines,et al.  Transformation Based Ensembles for Time Series Classification , 2012, SDM.

[17]  Yun Shang,et al.  A Note on the Extended Rosenbrock Function , 2006 .

[18]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Kristóf Marussy,et al.  Hubness-Aware Classification, Instance Selection and Feature Construction: Survey and Extensions to Time-Series , 2015, Feature Selection for Data and Pattern Recognition.

[20]  Michael Sonnenschein,et al.  Support vector based encoding of distributed energy resources' feasible load spaces , 2010, 2010 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT Europe).

[21]  G. Hays,et al.  Identification of genetically and oceanographically distinct blooms of jellyfish , 2013, Journal of The Royal Society Interface.

[22]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[23]  Nathalie Japkowicz,et al.  One-Class versus Binary Classification: Which and When? , 2012, 2012 11th International Conference on Machine Learning and Applications.

[24]  Paulo Cortez,et al.  Using sensitivity analysis and visualization techniques to open black box data mining models , 2013, Inf. Sci..

[25]  William Eberle,et al.  Genetic algorithms in feature and instance selection , 2013, Knowl. Based Syst..

[26]  András Kocsor,et al.  Counter-Example Generation-Based One-Class Classification , 2007, ECML.

[27]  Haibo He,et al.  Assessment Metrics for Imbalanced Learning , 2013 .

[28]  Emanuele Borgonovo,et al.  Sensitivity analysis: A review of recent advances , 2016, Eur. J. Oper. Res..