Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA

Streaming classification of big data is a method under stream data mining that learns from continuous, ordered sequences of data streams coming from diversified sources using limited computing and storage capabilities. SAMOA stands for scalable advanced massive online analysis, is a machine learning framework used to perform distributed data mining over streaming data. Vertical Hoeffding Tree (VHT) under SAMOA is a variant of very fast decision tree used for distributed classification of data streams. The performance of VHT depends on various critical parameters such as tie-threshold, grace value, confidence, split criterion, etc. Although, VHT is widely accepted as an efficient streaming classifier but one of the challenges in streaming classification is varying distribution of incoming data instances with respect to underlying classes in different datasets; therefore performance of VHT varies in different datasets. Therefore, achieving optimal performance from the stream classifier like VHT on different datasets is a challenging task and fixed set of values of critical parameters cannot be preconfigured for various types of datasets. This research work explores the capabilities of VHT streaming classifier of SAMOA in the light of various benchmarking performance statistics such as classification accuracy, kappa and kappa temporal. The work presented here, experimentally identifies suitable values of critical parameters of VHT that yield optimized performance on different datasets. Thus, this analytical study is extremely significant in developing streaming classifiers which achieve optimum performance via parameter tuning at run time.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Marcus A. Maloof,et al.  Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts , 2007, J. Mach. Learn. Res..

[3]  Shanwen Zhang,et al.  Dimension Reduction Using Semi-Supervised Locally Linear Embedding for Plant Leaf Classification , 2009, ICIC.

[4]  Geoff Holmes,et al.  Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them , 2013, ECML/PKDD.

[5]  Simon Fong,et al.  Moderated VFDT in Stream Mining Using Adaptive Tie Threshold and Incremental Pruning , 2011, DaWaK.

[6]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[7]  João Gama,et al.  Decision trees for mining data streams , 2006, Intell. Data Anal..

[8]  Bogdan Gabrys,et al.  Review of adaptation mechanisms for data-driven soft sensors , 2011, Comput. Chem. Eng..

[9]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[10]  Kwok-Wing Chau,et al.  ANN-based interval forecasting of streamflow discharges using the LUBE method and MOFIPS , 2015, Eng. Appl. Artif. Intell..

[11]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[12]  C. L. Wu,et al.  Methods to improve neural network performance in daily flows prediction , 2009 .

[13]  Yang Wang,et al.  Boosting for Learning Multiple Classes with Imbalanced Class Distribution , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  João Gama,et al.  On evaluating stream learning algorithms , 2012, Machine Learning.

[15]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[16]  Sonali Agarwal,et al.  Handling Big Data Stream Analytics using SAMOA Framework - A Practical Experience , 2014 .

[17]  Dimitris K. Tasoulis,et al.  Exponentially weighted moving average charts for detecting concept drift , 2012, Pattern Recognit. Lett..

[18]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[19]  Ernestina Menasalvas Ruiz,et al.  Learning recurring concepts from data streams with a context-aware ensemble , 2011, SAC.

[20]  K. Chau,et al.  A hybrid model coupled with singular spectrum analysis for daily rainfall prediction , 2010 .

[21]  John Langford,et al.  An iterative method for multi-class cost-sensitive learning , 2004, KDD.

[22]  Christophe G. Giraud-Carrier,et al.  A Note on the Utility of Incremental Learning , 2000, AI Commun..

[23]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[24]  Jun Zhang,et al.  Multilayer Ensemble Pruning via Novel Multi-sub-swarm Particle Swarm Optimization , 2009, J. Univers. Comput. Sci..

[25]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[26]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[27]  David Meerman Scott Real-Time Marketing and PR: How to Instantly Engage Your Market, Connect with Customers, and Create Products that Grow Your Business Now , 2010 .

[28]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[29]  Saso Dzeroski,et al.  Learning model trees from evolving data streams , 2010, Data Mining and Knowledge Discovery.

[30]  Xavier Amatriain,et al.  Mining large streams of user data for personalized recommendations , 2013, SKDD.

[31]  Elke A. Rundensteiner,et al.  Optimizing cyclic join view maintenance over distributed data sources , 2006, IEEE Transactions on Knowledge and Data Engineering.

[32]  Geoff Holmes,et al.  Leveraging Bagging for Evolving Data Streams , 2010, ECML/PKDD.

[33]  Theophano Mitsa,et al.  Temporal Data Mining , 2010 .

[34]  Kwok-wing Chau,et al.  Improving Forecasting Accuracy of Annual Runoff Time Series Using ARIMA Based on EEMD Decomposition , 2015, Water Resources Management.

[35]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[36]  Abraham Kandel,et al.  Knowledge discovery in data streams with regression tree methods , 2012, WIREs Data Mining Knowl. Discov..