A Relational Framework for Classifier Engineering

In the design of analytical procedures and machine-learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. We embark on the establishment of database foundations for feature engineering. Specifically, we propose a formal framework for classification in the context of a relational database. The goal of this framework is to open the way to research and techniques to assist developers with the task of feature engineering by utilizing the database's modeling and understanding of data and queries, and by deploying the well studied principles of database management. We demonstrate the usefulness of the framework by formally defining key algorithmic challenges and presenting preliminary complexity results.

[1]  Ronald Fagin,et al.  A logic for reasoning about probabilities , 1988, [1988] Proceedings. Third Annual Information Symposium on Logic in Computer Science.

[2]  Christopher Ré,et al.  A Relational Framework for Classifier Engineering , 2018, SGMD.

[3]  Jeffrey F. Naughton,et al.  To Join or Not to Join?: Thinking Twice about Joins before Feature Selection , 2016, SIGMOD Conference.

[4]  Jeffrey Heer,et al.  Enterprise Data Analysis and Visualization: An Interview Study , 2012, IEEE Transactions on Visualization and Computer Graphics.

[5]  Taisuke Sato,et al.  PRISM: A Language for Symbolic-Statistical Modeling , 1997, IJCAI.

[6]  R. Weischedel,et al.  Optimal Subset Selection: Multiple Regression, Interdependence and Optimal Network Algorithms , 1974 .

[7]  Indre Zliobaite,et al.  A survey on measuring indirect discrimination in machine learning , 2015, ArXiv.

[8]  Ross Willard Testing Expressibility Is Hard , 2010, CP.

[9]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[10]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[11]  Michael I. Jordan Graphical Models , 2003 .

[12]  Martin Grohe,et al.  Learning first-order definable concepts over structures of small degree , 2017, 2017 32nd Annual ACM/IEEE Symposium on Logic in Computer Science (LICS).

[13]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[14]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[15]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[16]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[17]  Balder ten Cate,et al.  Declarative Probabilistic Programming with Datalog , 2017, ACM Trans. Database Syst..

[18]  Stuart J. Russell,et al.  BLOG: Probabilistic Models with Unknown Objects , 2005, IJCAI.

[19]  Joseph Y. Halpern,et al.  From Statistical Knowledge Bases to Degrees of Belief , 1996, Artif. Intell..

[20]  Toniann Pitassi,et al.  Learning Fair Representations , 2013, ICML.

[21]  Balder ten Cate,et al.  The Product Homomorphism Problem and Applications , 2015, ICDT.

[22]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[23]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[24]  Christopher Ré,et al.  Brainwash: A Data System for Feature Engineering , 2013, CIDR.

[25]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[26]  Roni Khardon,et al.  Complexity parameters for first order classes , 2006, Machine Learning.

[27]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[28]  Guan Wang,et al.  An Integrated Development Environment for Faster Feature Engineering , 2014, Proc. VLDB Endow..

[29]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, The VLDB Journal.

[30]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[31]  Pablo Barceló,et al.  The complexity of reverse engineering problems for conjunctive queries , 2016, ICDT.

[32]  Alexander Gammerman,et al.  Learning by Transduction , 1998, UAI.

[33]  Christopher Ré,et al.  Materialization optimizations for feature selection workloads , 2014, SIGMOD Conference.

[34]  Frederick Reiss,et al.  Spanners: a formal framework for information extraction , 2013, PODS '13.

[35]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[36]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.