Estimating Data Integration and Cleaning Effort

Data cleaning and data integration have been the topic of intensive research for at least the past thirty years, resulting in a multitude of specialized methods and integrated tool suites. All of them require at least some and in most cases significant human input in their configuration, during processing, and for evaluation. For managers (and for developers and scientists) it would be therefore of great value to be able to estimate the effort of cleaning and integrating some given data sets and to know the pitfalls of such an integration project in advance. This helps deciding about an integration project using cost/benefit analysis, budgeting a team with funds and manpower, and monitoring its progress. Further, knowledge of how well a data source fits into a given data ecosystem improves source selection. We present an extensible framework for the automatic effort estimation for mapping and cleaning activities in data integration projects with multiple sources. It comprises a set of measures and methods for estimating integration complexity and ultimately effort, taking into account heterogeneities of both schemas and instances and regarding both integration and cleaning operations. Experiments on two real-world scenarios show that our proposal is two to four times more accurate than a current approach in estimating the time duration of an integration process, and provides a meaningful breakdown of the integration problems as well as the required integration activities.

[1]  Shrikanth S. Narayanan,et al.  COTS Integrations: Effort Estimation Best Practices , 2010, 2010 IEEE 34th Annual Computer Software and Applications Conference Workshops.

[2]  J. Euzenat,et al.  Ontology Matching , 2007, Springer Berlin Heidelberg.

[3]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[4]  Andrea Calì,et al.  A general datalog-based framework for tractable query answering over ontologies , 2009, SEBD.

[5]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[6]  Paolo Atzeni,et al.  A Universal Metamodel and Its Dictionary , 2009, Trans. Large Scale Data Knowl. Centered Syst..

[7]  Abdelkader Hameurlain,et al.  Transactions on Large-Scale Data- and Knowledge-Centered Systems I , 2009, Trans. Large-Scale Data- and Knowledge-Centered Systems.

[8]  Paolo Papotti,et al.  Mapping and cleaning , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[9]  L. Goddard Information Theory , 1962, Nature.

[10]  Alberto O. Mendelzon,et al.  Visualizing queries and querying visualizations , 1992, SGMD.

[11]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[12]  Paolo Papotti,et al.  ++Spicy: an OpenSource Tool for Second-Generation Schema Mapping and Data Exchange , 2011, Proc. VLDB Endow..

[13]  Divesh Srivastava,et al.  Finding Quality in Quantity: The Challenge of Discovering Valuable Sources for Integration , 2015, CIDR.

[14]  Paolo Papotti,et al.  Schema exchange: Generic mappings for transforming data and metadata , 2009, Data Knowl. Eng..

[15]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[16]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[17]  Ronald Fagin,et al.  Composing schema mappings: second-order dependencies to the rescue , 2004, PODS '04.

[18]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[19]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[20]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[21]  Ellis Horowitz,et al.  Software Cost Estimation with COCOMO II , 2000 .

[22]  Felix Naumann,et al.  Data profiling revisited , 2014, SGMD.

[23]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[24]  Arnon Rosenthal,et al.  The Role of Schema Matching in Large Enterprises , 2009, CIDR.

[25]  Andrea Calì,et al.  Data integration under integrity constraints , 2004, Inf. Syst..

[26]  Wang Chiew Tan,et al.  STBenchmark: towards a benchmark for mapping systems , 2008, Proc. VLDB Endow..

[27]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..