NOAH: Creating Data Integration Pipelines over Continuously Extracted Web Data

We present Noah, an ongoing research project aiming at developing a system for semi-automatically creating end-to-end Web data processing pipelines. The pipelines continuously extract and integrate information from multiple sites by leveraging the redundancy of the data published on the Web. The system is based on a novel hybrid human-machine learning approach in which the same type of questions can be interchangeably posed both to human crowd workers and to automatic responders based on machine learning (ML) models. Since the early stages of pipelines, crowd workers are engaged to guarantee the output data quality, and to collect training data, that are then used to progressively train and evaluate automatic responders. The latter are fully deployed into the data processing pipelines to scale the approach and to contain the crowdsourcing costs later. The combination of guaranteed quality and progressive reductions of costs of the pipelines generated by our system can improve the investments and development processes of many applications that build on the availability of such data processing pipelines.

[1]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[2]  Valter Crescenzi,et al.  Crowdsourcing for data management , 2017, Knowledge and Information Systems.

[3]  AnHai Doan,et al.  Toward a System Building Agenda for Data Integration (and Data Science) , 2018, IEEE Data Eng. Bull..

[4]  AnHai Doan,et al.  Human-in-the-Loop Data Analysis: A Personal Perspective , 2018, HILDA@SIGMOD.

[5]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[6]  Valter Crescenzi,et al.  Extraction and Integration of Partially Overlapping Web Sources , 2013, Proc. VLDB Endow..

[7]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[8]  Valter Crescenzi,et al.  Hybrid Crowd-Machine Wrapper Inference , 2019, ACM Trans. Knowl. Discov. Data.

[9]  Lay-Ki Soon,et al.  Schema-Agnostic Entity Matching using Pre-trained Language Models , 2020, CIKM.

[10]  Xin Luna Dong,et al.  CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web , 2018, Proc. VLDB Endow..

[11]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[12]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[13]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[14]  Mausam,et al.  Open Information Extraction Systems and Downstream Applications , 2016, IJCAI.

[15]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[16]  Alon Y. Halevy,et al.  Open Information Extraction from Question-Answer Pairs , 2019, NAACL.

[17]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[18]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[19]  André Freitas,et al.  A Survey on Open Information Extraction , 2018, COLING.

[20]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[21]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[22]  Joseph G. Davis,et al.  User interface design for crowdsourcing systems , 2014, AVI.

[23]  Ashwin Machanavajjhala,et al.  An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[24]  Tova Milo,et al.  Query-Oriented Data Cleaning with Oracles , 2015, SIGMOD Conference.

[25]  Tim Furche,et al.  RED: Redundancy-Driven Data Extraction from Result Pages? , 2019, WWW.

[26]  Marcus Kaiser,et al.  How to Measure Data Quality? - A Metric-Based Approach , 2007, ICIS.

[27]  Xin Dong,et al.  OpenCeres: When Open Information Extraction Meets the Semi-Structured Web , 2019, NAACL.

[28]  Wenfei Fan,et al.  Foundations of Data Quality Management , 2012, Foundations of Data Quality Management.

[29]  Tara S. Behrend,et al.  The viability of crowdsourcing for survey research , 2011, Behavior research methods.

[30]  Aditya G. Parameswaran,et al.  Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities , 2018, DEEM@SIGMOD.

[31]  Christopher Ré,et al.  Snorkel: Fast Training Set Generation for Information Extraction , 2017, SIGMOD Conference.

[32]  Tianjian Chen,et al.  Federated Machine Learning: Concept and Applications , 2019 .

[33]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[34]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[35]  Carsten Binnig,et al.  Democratizing Data Science through Interactive Curation of ML Pipelines , 2019, SIGMOD Conference.

[36]  Guoliang Li,et al.  Crowdsourced Data Management: Overview and Challenges , 2017, SIGMOD Conference.

[37]  Guoliang Li,et al.  Human-in-the-loop Data Integration , 2017, Proc. VLDB Endow..

[38]  Yunyao Li,et al.  Synthesizing Extraction Rules from User Examples with SEER , 2017, SIGMOD Conference.

[39]  Tim Furche,et al.  DIADEM: Thousands of Websites to a Single Database , 2014, Proc. VLDB Endow..

[40]  Aditya G. Parameswaran,et al.  Crowdsourced Data Management: Industry and Academic Perspectives , 2015, Found. Trends Databases.

[41]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.