Toward a System Building Agenda for Data Integration (and Data Science)

In this paper we argue that the data management community should devote far more effort to building data integration (DI) systems, in order to truly advance the field. Toward this goal, we make three contributions. First, we draw on our recent industrial experience to discuss the limitations of current DI systems. Second, we propose an agenda to build a new kind of DI systems to address these limitations. These systems guide users through the DI workflow, step by step. They provide tools to address the "pain points" of the steps, and tools are built on top of the Python data science and Big Data ecosystem (PyData). We discuss how to foster an ecosystem of such tools within PyData, then use it to build DI systems for collaborative/cloud/crowd/lay user settings. Finally, we discuss ongoing work at Wisconsin, which suggests that these DI systems are highly promising and building them raises many interesting research challenges.

[1]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[2]  AnHai Doan,et al.  Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services , 2017, SIGMOD Conference.

[3]  Sanjay Krishnan,et al.  PALM: Machine Learning Explanations For Iterative Debugging , 2017, HILDA@SIGMOD.

[4]  AnHai Doan,et al.  MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive , 2017, Bioinform..

[5]  Divesh Srivastava,et al.  Big Data Integration , 2015, Synthesis Lectures on Data Management.

[6]  Aditya G. Parameswaran,et al.  Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities , 2018, DEEM@SIGMOD.

[7]  Aditya G. Parameswaran,et al.  Crowdsourced Data Management: Industry and Academic Perspectives , 2015, Found. Trends Databases.

[8]  Peggy L. Peissig,et al.  CloudMatcher : A Cloud / Crowd Service for Entity Matching , 2017 .

[9]  Guoliang Li,et al.  String similarity search and join: a survey , 2016, Frontiers of Computer Science.

[10]  Xin Luna Dong,et al.  CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web , 2018, Proc. VLDB Endow..

[11]  Guoliang Li,et al.  Crowdsourced Data Management: A Survey , 2016, IEEE Transactions on Knowledge and Data Engineering.

[12]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[13]  Alon Y. Halevy,et al.  Koko: A System for Scalable Semantic Querying of Text , 2018, Proc. VLDB Endow..

[14]  Peggy L. Peissig,et al.  Entity Matching Using Magellan: Matching Drug Reference Tables , 2017, CRI.

[15]  Felix Naumann,et al.  Efficient Discovery of Approximate Dependencies , 2018, Proc. VLDB Endow..

[16]  Fotis Psallidas,et al.  Smoke: Fine-grained Lineage at Interactive Speed , 2018, Proc. VLDB Endow..

[17]  Laura M. Haas The Power Behind the Throne: Information Integration in the Age of Data-Driven Discovery , 2015, SIGMOD Conference.

[18]  Sebastian Link,et al.  Data Quality: The Role of Empiricism , 2018, SGMD.

[19]  Ihab F. Ilyas,et al.  Distributed Data Deduplication , 2016, Proc. VLDB Endow..

[20]  Jeffrey F. Naughton,et al.  Towards Interactive Debugging of Rule-based Entity Matching , 2017, EDBT.

[21]  AnHai Doan,et al.  Human-in-the-Loop Challenges for Entity Matching: A Midterm Report , 2017, HILDA@SIGMOD.

[22]  Alex Endert,et al.  Visual Graph Query Construction and Refinement , 2017, SIGMOD Conference.

[23]  Olga Papaemmanouil,et al.  Interactive Data Exploration via Machine Learning Models , 2016, IEEE Data Eng. Bull..

[24]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[25]  Laura M. Haas,et al.  Information integration in the enterprise , 2008, CACM.

[26]  Jeffrey Heer,et al.  Predictive Interaction for Data Transformation , 2015, CIDR.

[27]  Paolo Papotti,et al.  Synthesizing Entity Matching Rules by Examples , 2017, Proc. VLDB Endow..

[28]  Chen Chen,et al.  BigGorilla: An Open-Source Ecosystem for Data Preparation and Integration , 2018, IEEE Data Eng. Bull..

[29]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[30]  Ihab F. Ilyas,et al.  Trends in Cleaning Relational Data: Consistency and Deduplication , 2015, Found. Trends Databases.

[31]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[32]  AnHai Doan,et al.  MatchCatcher: A Debugger for Blocking in Entity Matching , 2018, EDBT.

[33]  Jayant Madhavan,et al.  OpenII: an open source information integration toolkit , 2010, SIGMOD Conference.

[34]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[35]  Guoliang Li,et al.  Human-in-the-loop Data Integration , 2017, Proc. VLDB Endow..

[36]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[37]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[38]  Michael Stonebraker,et al.  The Data Civilizer System , 2017, CIDR.

[39]  AnHai Doan,et al.  Technical Perspective:: Toward Building Entity Matching Management Systems , 2016, SGMD.