vtreat: a data.frame Processor for Predictive Modeling

We look at common problems found in data that is used for predictive modeling tasks, and describe how to address them with the vtreat R package. vtreat prepares real-world data for predictive modeling in a reproducible and statistically sound manner. We describe the theory of preparing variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems dealt with include: infinite values, invalid values, NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). Of special interest are techniques needed to avoid needlessly introducing undesirable nested modeling bias (which is a risk when using a data-preprocessor).

[1]  R. Cody Cody's Data Cleaning Techniques Using SAS , 2015 .

[2]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[3]  Daniele Micci-Barreca,et al.  A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems , 2001, SKDD.

[4]  Alan C Elliott,et al.  Preparing Data for Analysis Using Microsoft Excel , 2006, Journal of Investigative Medicine.

[5]  Nina Zumel,et al.  vtreat: A Statistically Sound 'data.frame' Processor/Conditioner , 2015 .

[6]  Bernd Bischl,et al.  mlr: Machine Learning in R , 2016, J. Mach. Learn. Res..

[7]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[8]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[9]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[10]  S. Geer,et al.  General oracle inequalities for model selection , 2009 .

[11]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[12]  Grzegorz Swirszcz,et al.  On cross-validation and stacking: building seemingly predictive models on random data , 2011, SKDD.

[13]  Theodore Johnson,et al.  Exploratory Data Mining and Data Cleaning , 2003 .

[14]  Nina Zumel,et al.  Practical Data Science with R , 2014 .

[15]  Robert E. Sweeney,et al.  A Transformation for Simplifying the Interpretation of Coefficients of Binary Variables in Regression Analysis , 1972 .

[16]  D. Freedman A Note on Screening Regression Equations , 1983 .

[17]  Gary King,et al.  Amelia II: A Program for Missing Data , 2011 .

[18]  Jean-Michel Poggi,et al.  Variable selection using random forests , 2010, Pattern Recognit. Lett..

[19]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[20]  J. Tukey The Future of Data Analysis , 1962 .