Declarative XML Data Cleaning with XClean

Data cleaning is the process of correcting anomalies in a data source, that may for instance be due to typographical errors, or duplicate representations of an entity. It is a crucial task in customer relationship management, data mining, and data integration.With the growing amount of XML data, approaches to effectively and efficiently clean XML are needed, an issue not addressed by existing data cleaning systems that mostly specialize on relational data. We present XClean, a data cleaning framework specifically geared towards cleaning XML data. XClean's approach is based on a set of cleaning operators, whose semantics is well-defined in terms of XML algebraic operators. Users may specify cleaning programs by combining operators by means of a declarative XClean/PL program, which is then compiled into XQuery. We describe XClean's operators, language, and compilation approach, and validate its effectiveness through a series of case studies.

[1]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[2]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[3]  Kyuseok Shim,et al.  Query Optimization in the Presence of Foreign Functions , 1993, VLDB.

[4]  Torsten Grust,et al.  Advances in database technology - EDBT 2006 : 10th International Conference on Extending Database Technology, Munich, Germany, March 2006; proceedings , 2006 .

[5]  Felix Naumann,et al.  DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[6]  Felix Naumann,et al.  Detecting Duplicates in Complex XML Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[7]  Felix Naumann,et al.  Declarative Data Fusion - Syntax, Semantics, and Implementation , 2005, ADBIS.

[8]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[9]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[10]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[11]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[12]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[13]  Felix Naumann,et al.  Automatic Data Fusion with HumMer , 2005, VLDB.

[14]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[15]  Jeffrey F. Naughton,et al.  Query execution techniques for caching expensive methods , 1996, SIGMOD '96.

[16]  Felix Naumann,et al.  XML Duplicate Detection Using Sorted Neighborhoods , 2006, EDBT.

[17]  Hamid Pirahesh,et al.  Extending XQuery for analytics , 2005, SIGMOD '05.

[18]  Daniela Florescu,et al.  XML query processing , 2004, Proceedings. 20th International Conference on Data Engineering.

[19]  Kaizhong Zhang,et al.  Exact and approximate algorithms for unordered tree matching , 1994, IEEE Trans. Syst. Man Cybern..

[20]  Luís Torgo,et al.  Knowledge Discovery in Databases: PKDD 2005, 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, October 3-7, 2005, Proceedings , 2005, PKDD.

[21]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[22]  Elio Masciari,et al.  Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[23]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[24]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[25]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[26]  Yannis Papakonstantinou,et al.  Object Fusion in Mediator Systems , 1996, VLDB.

[27]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[28]  Michael J. Carey,et al.  Programming with XQuery , 2006, XIME-P.

[29]  Surajit Chaudhuri,et al.  Data cleaning in microsoft SQL server 2005 , 2005, SIGMOD '05.

[30]  Stéphane Bressan,et al.  Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web , 2003, Lecture Notes in Computer Science.