Machine Learning to Data Management: A Round Trip

With the emergence of machine learning (ML) techniques in database research, ML has already proved a tremendous potential to dramatically impact the foundations, algorithms, and models of several data management tasks, such as error detection, data cleaning, data integration, and query inference. Part of the data preparation, standardization, and cleaning processes, such as data matching and deduplication for instance, could be automated by making a ML model "learn" and predict the matches routinely. Data integration can also benefit from ML as the data to be integrated can be sampled and used to design the data integration algorithms. After the initial manual work to setup the labels, ML models can start learning from the new incoming data that are being submitted for standardization, integration, and cleaning. The more data supplied to the model, the better the ML algorithm can perform and deliver accurate results. Therefore, ML is more scalable compared to traditional and time-consuming approaches. Nevertheless, many ML algorithms require an out-of-the-box tuning and their parameters and scope are often not adapted to the problem at hand. To make an example, in cleaning and integration processes, the window sizes of values used for the ML models cannot be arbitrarily chosen and require an adaptation of the learning parameters. This tutorial will survey the recent trend of applying machine learning solutions to improve data management tasks and establish new paradigms to sharpen data error detection, cleaning, and integration at the data instance level, as well as at schema, system, and user levels.

[1]  Divesh Srivastava,et al.  Data Fusion: Resolving Conflicts from Multiple Sources , 2013, WAIM.

[2]  Michael J. Cafarella,et al.  Input selection for fast feature engineering , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[3]  Qing Wang,et al.  Improving Temporal Record Linkage Using Regression Classification , 2017, PAKDD.

[4]  Renée J. Miller,et al.  Schema Discovery , 2003, IEEE Data Eng. Bull..

[5]  Aurélien Lemay,et al.  Learning Path Queries on Graph Databases , 2015, EDBT.

[6]  Mourad Ouzzani,et al.  UGuide: User-Guided Discovery of FD-Detectable Errors , 2017, SIGMOD Conference.

[7]  Mikhail Bilenko,et al.  Learnable Similarity Functions and their Applications to Clustering and Record Linkage , 2004, AAAI.

[8]  Renée J. Miller,et al.  Information-theoretic tools for mining database structure from large data sets , 2004, SIGMOD '04.

[9]  Dan Olteanu,et al.  Learning Linear Regression Models over Factorized Joins , 2016, SIGMOD Conference.

[10]  Sunil Prabhakar,et al.  Staging User Feedback toward Rapid Conflict Resolution in Data Fusion , 2017, SIGMOD Conference.

[11]  Laure Berti-Équille,et al.  A masking index for quantifying hidden glitches , 2013, 2013 IEEE 13th International Conference on Data Mining.

[12]  Henning Fernau,et al.  Algorithms for learning regular expressions from positive data , 2009, Inf. Comput..

[13]  Michael J. Cafarella,et al.  Database Learning: Toward a Database that Becomes Smarter Every Time , 2017, SIGMOD Conference.

[14]  Tova Milo,et al.  DANCE: Data Cleaning with Constraints and Experts , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[15]  Chunping Li,et al.  Turn Waste into Wealth: On Simultaneous Clustering and Cleaning over Dirty Data , 2015, KDD.

[16]  Renée J. Miller,et al.  Continuous data cleaning , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[17]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[18]  Philip S. Yu,et al.  Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing , 2017, Proc. VLDB Endow..

[19]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[20]  Christopher Ré,et al.  Snorkel: A System for Lightweight Extraction , 2017, CIDR.

[21]  Subbarao Kambhampati,et al.  BayesWipe: A Scalable Probabilistic Framework for Improving Data Quality , 2016, JDIQ.

[22]  Tim Kraska,et al.  Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype? , 2015, SIGMOD Conference.

[23]  Tim Kraska,et al.  A Data Quality Metric (DQM): How to Estimate the Number of Undetected Errors in Data Sets , 2016, Proc. VLDB Endow..

[24]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[25]  AnHai Doan,et al.  Human-in-the-Loop Challenges for Entity Matching: A Midterm Report , 2017, HILDA@SIGMOD.

[26]  Tova Milo,et al.  Query-Oriented Data Cleaning with Oracles , 2015, SIGMOD Conference.

[27]  Felix Naumann,et al.  A Machine Learning Approach to Foreign Key Discovery , 2009, WebDB.

[28]  Hector Garcia-Molina,et al.  Pay-As-You-Go Entity Resolution , 2013, IEEE Transactions on Knowledge and Data Engineering.

[29]  Frank Neven,et al.  Learning deterministic regular expressions for the inference of schemas from XML data , 2010, ACM Trans. Web.

[30]  Tim Kraska,et al.  A sample-and-clean framework for fast and accurate query processing on dirty data , 2014, SIGMOD Conference.

[31]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[32]  Lei Chen,et al.  CrowdMatcher: crowd-assisted schema matching , 2014, SIGMOD Conference.

[33]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[34]  Renée J. Miller,et al.  A Collective, Probabilistic Approach to Schema Mapping , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[35]  H. V. Jagadish,et al.  Foofah: Transforming Data By Example , 2017, SIGMOD Conference.

[36]  Divesh Srivastava,et al.  Data Fusion: Resolving Conflicts from Multiple Sources , 2013, WAIM.

[37]  Divesh Srivastava,et al.  Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[38]  Paolo Papotti,et al.  The LLUNATIC Data-Cleaning Framework , 2013, Proc. VLDB Endow..

[39]  Frank Neven,et al.  Definability problems for graph query languages , 2013, ICDT '13.

[40]  Peter A. Flach,et al.  Database Dependency Discovery: A Machine Learning Approach , 1999, AI Commun..

[41]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[42]  Christopher Ré,et al.  Learning the Structure of Generative Models without Labeled Data , 2017, ICML.

[43]  Ahmed K. Elmagarmid,et al.  Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes , 2013, SIGMOD '13.

[44]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[45]  Angela Bonifati,et al.  Learning Join Queries from User Examples , 2016, ACM Trans. Database Syst..

[46]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[47]  Tim Kraska,et al.  SampleClean: Fast and Reliable Analytics on Dirty Data , 2015, IEEE Data Eng. Bull..

[48]  F. L. Bauer,et al.  Entity Resolution , 2011, Encyclopedia of Cryptography and Security.