Improved randomized learning algorithms for imbalanced and noisy educational data classification

Despite that neural networks have demonstrated their good potential to be used in constructing learners which exhibit strong predictive performance, there are still some uncertainty issues that can greatly affect the effectiveness of the employed supervised learning algorithms, such as class imbalance and labeling errors (or class noise). Technically, imbalanced data resource can cause more difficulties or limitations for learning algorithms to distinguish different classes, while data with labeling errors can lead to an unreasonable problem formulation due to incorrect hypotheses. Indeed, noise and class imbalance are pervasive problems in the domain of educational data analytics. This study aims at developing improved randomized learning algorithms by investigating a novel type of cost function that focuses on the combined effects of class imbalance and class noise. Instead of concerning these uncertainty issues isolation, we present a convex combination of robust and imbalanced modelling objectives, contributing to a generalized formulation of weighted least squares problems by which the improved randomized learner models can be built. Our experimental study on several educational data classification tasks have verified the advantages of our proposed algorithms, in comparison with some existing methods that either takes no account of class imbalance and labeling errors, or merely consider one specific aspect in problem-solving.

[1]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[2]  Dianhui Wang,et al.  Stochastic Configuration Networks: Fundamentals and Algorithms , 2017, IEEE Transactions on Cybernetics.

[3]  Ming Li,et al.  Two Dimensional Stochastic Configuration Networks for Image Data Analytics , 2018, ArXiv.

[4]  Andrés R. Masegosa,et al.  Bagging Decision Trees on Data Sets with Classification Noise , 2010, FoIKS.

[5]  Taghi M. Khoshgoftaar,et al.  Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[6]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[7]  Paulo Cortez,et al.  Using data mining to predict secondary school student performance , 2008 .

[8]  Roni Khardon,et al.  Noise Tolerant Variants of the Perceptron Algorithm , 2007, J. Mach. Learn. Res..

[9]  Y. Takefuji,et al.  Functional-link net computing: theory, system architecture, and functionalities , 1992, Computer.

[10]  Liva Ralaivola,et al.  Learning SVMs from Sloppily Labeled Data , 2009, ICANN.

[11]  Ming Li,et al.  Insights into randomized algorithms for neural networks: Practical issues and common pitfalls , 2017, Inf. Sci..

[12]  Nuno Vasconcelos,et al.  On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost , 2008, NIPS.

[13]  Nikunj C. Oza,et al.  AveBoost2: Boosting for Noisy Data , 2004, Multiple Classifier Systems.

[14]  Yoh-Han Pao,et al.  Stochastic choice of basis functions in adaptive function approximation and the functional-link net , 1995, IEEE Trans. Neural Networks.

[15]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[17]  Ming Li,et al.  Deep Stochastic Configuration Networks with Universal Approximation Property , 2017, 2018 International Joint Conference on Neural Networks (IJCNN).

[18]  Ming Li,et al.  Robust stochastic configuration networks with kernel density estimation for uncertain data regression , 2017, Inf. Sci..

[19]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[20]  Taghi M. Khoshgoftaar,et al.  Supervised Neural Network Modeling: An Empirical Investigation Into Learning From Imbalanced Data With Labeling Errors , 2010, IEEE Transactions on Neural Networks.

[21]  Dianhui Wang,et al.  Stochastic configuration networks ensemble with heterogeneous features for large-scale data analytics , 2017, Inf. Sci..

[22]  Ming Li,et al.  Robust stochastic configuration networks with maximum correntropy criterion for uncertain data regression , 2019, Inf. Sci..

[23]  P. Lancaster,et al.  The theory of matrices : with applications , 1985 .

[24]  Dianhui Wang,et al.  Randomness in neural networks: an overview , 2017, WIREs Data Mining Knowl. Discov..

[25]  Dejan J. Sobajic,et al.  Learning and generalization characteristics of the random vector Functional-link net , 1994, Neurocomputing.

[26]  Ivan Tyukin,et al.  Approximation with random bases: Pro et Contra , 2015, Inf. Sci..