A new generation of data processing systems, including web search, Google’s Knowledge Graph, IBM’s Watson, and several different recommendation systems, combine rich databases with software driven by machine learning. The spectacular successes of these trained systems have been among the most notable in all of computing and have generated excitement in health care, finance, energy, and general business. But building them can be challenging, even for computer scientists with PhD-level training. If these systems are to have a truly broad impact, building them must become easier. We explore one crucial pain point in the construction of trained systems: feature engineering. Given the sheer size of modern datasets, feature developers must (1) write code with few effective clues about how their code will interact with the data and (2) repeatedly endure long system waits even though their code typically changes little from run to run. We propose brainwash, a vision for a feature engineering data system that could dramatically ease the ExploreExtract-Evaluate interaction loop that characterizes many trained system projects.
[1]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[2]
Peter Norvig,et al.
The Unreasonable Effectiveness of Data
,
2009,
IEEE Intelligent Systems.
[3]
Jennifer Chu-Carroll,et al.
Building Watson: An Overview of the DeepQA Project
,
2010,
AI Mag..
[4]
Christopher Ré,et al.
Big Data versus the Crowd: Looking for Relationships in All the Right Places
,
2012,
ACL.