Feature bundling in decision tree algorithm

In empirical data modelling, a model of system is built up from a set of cases that the system has observed. Eventually, the performance of the inducted model is dominated by the quality and quantity of observations. Feature transformation methods are widely used to improve quality of knowledge extracted from observations to build up more accurate and robust model. In the paper, a new feature transformation method named dynamical feature bundling for decision tree algorithm is proposed. Dynamical feature bundling groups a set of features in the tree induction phase and it enables decision tree algorithms to 1) make use of features in one bundle together to make collective judgments in splitting phase; 2) learn more reliable and stable knowledge from feature bundles created based on domain knowledge of experts; 3) embed feature transformation step into tree induction phase, and therefore the extra pre-process step which are necessary for static feature transformation methods is inessential. Our experiments show 2%–9% improvements of AUC value on a very imbalanced dataset. Slight improvements are also obtained on a more balanced data set.

[1]  Isabelle Guyon,et al.  An Introduction to Feature Extraction , 2006, Feature Extraction.

[2]  Ludovic Denoyer,et al.  Web spam challenge 2008 , 2008, AIRWeb 2008.

[3]  H. Abdi,et al.  Principal component analysis , 2010 .

[4]  Andrew Kusiak,et al.  Feature transformation methods in data mining , 2001 .

[5]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[6]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[7]  Kumar Chellapilla,et al.  Fourth international workshop on adversarial information retrieval on the web (AIRWeb 2008) , 2008, WWW.

[8]  Wei Ge,et al.  Effects of feature construction on classification performance: An empirical study in bank failure prediction , 2009, Expert Syst. Appl..

[9]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[11]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[12]  Juan Martínez-Romo,et al.  Web spam identification through language model analysis , 2009, AIRWeb '09.

[13]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[14]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[15]  Huan Liu,et al.  IEEE Intelligent Systems , 2019, Computer.

[16]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[17]  Hongfei Lin,et al.  Combating Web spam through trust-distrust propagation with confidence , 2013, Pattern Recognit. Lett..

[18]  Christopher J. Matheus,et al.  The Need for Constructive Induction , 1991, ML.

[19]  Luca Becchetti,et al.  Link-Based Characterization and Detection of Web Spam , 2006, AIRWeb.

[20]  Hector Garcia-Molina,et al.  Link Spam Alliances , 2005, VLDB.