How to adjust an ensemble size in stream data mining?

In this paper we propose a new approach for designing an ensemble applied to stream data classification. Our approach is supported by two theorems showing how to decide whether a new component should be added to the ensemble or not, based on the assumption that such an action should increase the accuracy of the ensemble not only for the current portion of observations but also for the whole (infinite) data stream. The conclusions of these theorems hold with a certain probability (confidence) set by the user. Through computer simulations, among others, we show that decreasing the confidence that decision based on the finite portion of the stream is the same as based on the whole (infinite) data stream only slightly improves the accuracy at the expense of significant memory consumption. Moreover, we will introduce a novel procedure of weighting ensemble components, i.e. decision trees, by assigning a weight to each leaf of the tree. In previous approaches a weight was assigned to the whole ensemble component. The new approach is based on the observation that probability of the correct tree outcome is different in various tree sections.

[1]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[2]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[3]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[4]  Daniel Hernández-Lobato,et al.  How large should ensembles of classifiers be? , 2013, Pattern Recognit..

[5]  Xin Yao,et al.  A multi-agent evolutionary algorithm for software module clustering problems , 2016, Soft Computing.

[6]  Manisha Rathi Regression Modeling Technique on Data Mining for Prediction of CRM , 2010, ICT.

[7]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[8]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[9]  Kapil Wankhade,et al.  Data Streams Mining , 2010 .

[10]  Piotr Duda,et al.  A New Method for Data Stream Mining Based on the Misclassification Error , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[11]  João Gama,et al.  A survey on learning from data streams: current and future trends , 2012, Progress in Artificial Intelligence.

[12]  Departmentof Cse Classification Methods in Data Mining:A Detailed Survey , 2014 .

[13]  Witold Pedrycz,et al.  A Study on Relationship Between Generalization Abilities and Fuzziness of Base Classifiers in Ensemble Learning , 2015, IEEE Transactions on Fuzzy Systems.

[14]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[16]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[17]  Ludmila I. Kuncheva,et al.  PCA Feature Extraction for Change Detection in Multidimensional Unlabeled Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Carlo Zaniolo,et al.  An Adaptive Nearest Neighbor Classification Algorithm for Data Streams , 2005, PKDD.

[19]  A. V. D. Vaart,et al.  Asymptotic Statistics: U -Statistics , 1998 .

[20]  Xiaoou Li,et al.  Data Stream Classification for Structural Health Monitoring via On-Line Support Vector Machines , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[21]  Scott B. Baden,et al.  SCALLOP: A Highly Scalable Parallel Poisson Solver in Three Dimensions , 2003, SC.

[22]  LastMark Online classification of nonstationary data streams , 2002 .

[23]  Piotr Duda,et al.  The CART decision tree for mining data streams , 2014, Inf. Sci..

[24]  Rick Durrett,et al.  Probability: Theory and Examples, 4th Edition , 2010 .

[25]  Ketan Shah,et al.  Survey on data mining classification techniques , 2011, ICWET.

[26]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[27]  Kaizhu Huang,et al.  DE2: Dynamic ensemble of ensembles for learning nonstationary data , 2015, Neurocomputing.

[28]  Hao Wang,et al.  Learning concept-drifting data streams with random ensemble decision trees , 2015, Neurocomputing.

[29]  Ponnuthurai N. Suganthan,et al.  Ensemble Classification and Regression-Recent Developments, Applications and Future Directions [Review Article] , 2016, IEEE Computational Intelligence Magazine.

[30]  Xin Yao,et al.  DDD: A New Ensemble Approach for Dealing with Concept Drift , 2012, IEEE Transactions on Knowledge and Data Engineering.

[31]  Gregory Ditzler,et al.  Learning in Nonstationary Environments: A Survey , 2015, IEEE Computational Intelligence Magazine.

[32]  P.S.K Patra,et al.  Classification Methods In Data Mining:A Detailed Survey , 2014 .

[34]  João Gama,et al.  Decision trees for mining data streams , 2006, Intell. Data Anal..

[35]  Jerzy Stefanowski,et al.  Combining block-based and online methods in learning ensembles from concept drifting data streams , 2014, Inf. Sci..

[36]  Geoff Holmes,et al.  Active Learning With Drifting Streaming Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[37]  Mark Last,et al.  Online classification of nonstationary data streams , 2002, Intell. Data Anal..

[38]  Jerzy Stefanowski,et al.  Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[39]  Konrad Jackowski,et al.  Fixed-size ensemble classifier system evolutionarily adapted to a recurring context with an unlimited pool of classifiers , 2013, Pattern Analysis and Applications.

[40]  Mohamed Medhat Gaber,et al.  On-board Mining of Data Streams in Sensor Networks , 2005 .

[41]  Marimuthu Palaniswami,et al.  Fuzzy c-Means Algorithms for Very Large Data , 2012, IEEE Transactions on Fuzzy Systems.

[42]  Piotr Duda,et al.  Decision Trees for Mining Data Streams Based on the McDiarmid's Bound , 2013, IEEE Transactions on Knowledge and Data Engineering.

[43]  Franco Turini,et al.  Stream mining: a novel architecture for ensemble-based classification , 2011, Knowledge and Information Systems.

[44]  Albert Bifet,et al.  DATA STREAM MINING A Practical Approach , 2009 .

[45]  Vipin Kumar,et al.  Summarization - compressing data into an informative representation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[46]  Piotr Duda,et al.  Decision Trees for Mining Data Streams Based on the Gaussian Approximation , 2014, IEEE Transactions on Knowledge and Data Engineering.

[47]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[48]  S. Sukumaran,et al.  A study on classification techniques in data mining , 2013, 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[49]  Xizhao Wang,et al.  Fuzziness based sample categorization for classifier performance improvement , 2015, J. Intell. Fuzzy Syst..

[50]  Xizhao Wang,et al.  Segment Based Decision Tree Induction With Continuous Valued Attributes , 2015, IEEE Transactions on Cybernetics.

[51]  Philip S. Yu,et al.  Pruning and dynamic scheduling of cost-sensitive ensembles , 2002, AAAI/IAAI.

[52]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[53]  M. D. Ingle,et al.  SVM based Feature Extraction for Novel Class Detection from Streaming Data , 2015 .

[54]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.