Avoiding Anomalies in Data Stream Learning

The presence of anomalies in data compromises data quality and can reduce the effectiveness of learning algorithms. Standard data mining methodologies refer to data cleaning as a pre-processing before the learning task. The problem of data cleaning is exacerbated when learning in the computational model of data streams. In this paper we present a streaming algorithm for learning classification rules able to detect contextual anomalies in the data. Contextual anomalies are surprising attribute values in the context defined by the conditional part of the rule. For each example we compute the degree of anomaliness based on the probability of the attribute-values given the conditional part of the rule covering the example. The examples with high degree of anomaliness are signaled to the user and not used to train the classifier. The experimental evaluation in real-world data sets shows the ability to discover anomalous examples in the data. The main advantage of the proposed method is the ability to inform the context and explain why the anomaly occurs.

[1]  José Barateiro,et al.  A Survey of Data Quality Tools , 2005, Datenbank-Spektrum.

[2]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[3]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[4]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[5]  João Gama,et al.  Handling Time Changing Data with Adaptive Very Fast Decision Rules , 2012, ECML/PKDD.

[6]  Bogdan Gabrys,et al.  Adaptive Preprocessing for Streaming Data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[7]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[8]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[9]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[10]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[11]  J. Ross Quinlan KDD-99 panel on last 10 and next 10 years , 2000, SKDD.

[12]  João Gama,et al.  Learning Decision Rules from Data Streams , 2011, IJCAI.

[13]  João Gama,et al.  Decision trees for mining data streams , 2006, Intell. Data Anal..

[14]  Meina Song,et al.  Survey on data quality , 2012, 2012 World Congress on Information and Communication Technologies.

[15]  Saso Dzeroski,et al.  Learning model trees from evolving data streams , 2010, Data Mining and Knowledge Discovery.

[16]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[17]  Svetha Venkatesh,et al.  Anomaly detection in large-scale data stream networks , 2012, Data Mining and Knowledge Discovery.

[18]  Peter Clark,et al.  Rule Induction with CN2: Some Recent Improvements , 1991, EWSL.

[19]  Claude Sammut,et al.  Extracting Hidden Context , 1998, Machine Learning.

[20]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.