Density-Based Core Support Extraction for Non-stationary Environments with Extreme Verification Latency

Machine learning solutions usually consider that the train and test data has the same probabilistic distribution, that is, the data is stationary. However, in streaming scenarios, data distribution generally change through the time, that is, the data is non-stationary. The main challenge in such online environment is the model adaptation for the constant drifts in data distribution. Besides, other important restriction may happen in online scenarios: the extreme latency to verify the labels. Worth to mention that the incremental drift assumption is that class distributions overlap at subsequent time steps. Hence, the core region of data distribution have significant overlap with incoming data. Therefore, selecting samples from these core regions helps to retain the most important instances that represent the new distribution. This selection is denominated core support extraction (CSE). Thus, we present a study about density-based algorithms applied in non-stationary environments. We compared KDE, GMM and two variations of DBSCAN against single semi-supervised approaches. We validated these approaches in seventeen synthetic datasets and a real one, showing the strengths and weaknesses of these CSE methods through many metrics. We show that a semi-supervised classifier is improved up to 68% on a real dataset when it is applied along with a density-based CSE algorithm. The results between KDE and GMM, as CSE methods, were close but the approach using KDE is more practical due to having less parameters.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[3]  David G. Kirkpatrick,et al.  On the shape of a set of points in the plane , 1983, IEEE Trans. Inf. Theory.

[4]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[5]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[6]  João Gama,et al.  Issues in evaluation of stream learning algorithms , 2009, KDD.

[7]  Gregory Ditzler,et al.  Hellinger distance based drift detection for nonstationary environments , 2011, 2011 IEEE Symposium on Computational Intelligence in Dynamic and Uncertain Environments (CIDUE).

[8]  Maria E. Orlowska,et al.  Recency-based collaborative filtering , 2006, ADC.

[9]  Roy A. Maxion,et al.  Why Did My Detector Do That?! - Predicting Keystroke-Dynamics Error Rates , 2010, RAID.

[10]  Robi Polikar,et al.  LEVELIW: Learning extreme verification latency with importance weighting , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[11]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[12]  João Gama,et al.  Classification of Evolving Data Streams with Infinitely Delayed Labels , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[13]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[14]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[15]  Robi Polikar,et al.  Incremental Learning of Concept Drift in Nonstationary Environments , 2011, IEEE Transactions on Neural Networks.

[16]  João Gama,et al.  Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency , 2015, SDM.

[17]  Douglas A. Reynolds Gaussian Mixture Models , 2009, Encyclopedia of Biometrics.

[18]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[19]  Michaela M. Black,et al.  The Impact of Latency on Online Classification Learning with Concept Drift , 2010, KSEM.

[20]  Abdulkadir Sengür,et al.  Comparison of clustering algorithms for analog modulation classification , 2006, Expert Syst. Appl..

[21]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[22]  V. V. Markellos,et al.  A grid search for families of periodic orbits in the restricted problem of three bodies , 1974 .

[23]  Niall M. Adams,et al.  The impact of changing populations on classifier performance , 1999, KDD '99.

[24]  Robi Polikar,et al.  Core support extraction for learning from initially labeled nonstationary environments using COMPOSE , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[25]  Robi Polikar,et al.  Learning under extreme verification latency quickly: FAST COMPOSE , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[26]  Lee Luan Ling,et al.  User authentication through typing biometrics features , 2005 .

[27]  Jerzy Stefanowski,et al.  Accuracy Updated Ensemble for Data Streams with Concept Drift , 2011, HAIS.

[28]  Chris Mellish,et al.  Advances in Instance Selection for Instance-Based Learning Algorithms , 2002, Data Mining and Knowledge Discovery.

[29]  Abdulkadir Sengur,et al.  Comparison of clustering algorithms for analog modulation classification , 2006 .

[30]  Vinícius M. A. de Souza,et al.  Classification of Data Streams Applied to Insect Recognition: Initial Results , 2013, 2013 Brazilian Conference on Intelligent Systems.

[31]  Geoff Holmes,et al.  Evaluation methods and decision theory for classification of streaming data with temporal dependence , 2015, Machine Learning.

[32]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[33]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[34]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[35]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[36]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[37]  Robi Polikar,et al.  COMPOSE: A Semisupervised Learning Framework for Initially Labeled Nonstationary Streaming Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.