An empirical analysis of feature engineering for predictive modeling

Machine learning models, such as neural networks, decision trees, random forests and gradient boosting machines accept a feature vector and provide a prediction. These models learn in a supervised fashion where a set of feature vectors with expected output is provided. It is very common practice to engineer new features from the provided feature set. Such engineered features will either augment, or replace portions of the existing feature vector. These engineered features are essentially calculated fields, based on the values of the other features. Engineering such features is primarily a manual, time-consuming task. Additionally, each type of model will respond differently to different types of engineered features. This paper reports on empirical research to demonstrate what types of engineered features are best suited to which machine learning model type. This is accomplished by generating several datasets that are designed to benefit from a particular type of engineered feature. The experiment demonstrates to what degree the machine learning model is capable of synthesizing the needed feature on its own. If a model is capable of synthesizing an engineered feature, it is not necessary to provide that feature. The research demonstrated that the studied models do indeed perform differently with various types of engineered features.

[1]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[2]  M. Mignotte,et al.  ON THE DISTANCE BETWEEN ROOTS OF INTEGER POLYNOMIALS , 2004, Proceedings of the Edinburgh Mathematical Society.

[3]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[4]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[5]  Shou-De Lin,et al.  Feature Engineering and Classifier Ensemble for KDD Cup 2010 , 2010, KDD 2010.

[6]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[7]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[13]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[14]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[15]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[16]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[17]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[18]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[19]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[20]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[21]  David Haussler,et al.  Proceedings of the fifth annual workshop on Computational learning theory , 1992, COLT 1992.

[22]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[23]  J. Keith Ord,et al.  Automatic neural network modeling for univariate time series , 2000 .

[24]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[25]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .

[26]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent in Function Space , 2007 .