From relations to XML : cleaning, integrating and securing data

While relational databases are still the preferred approach for storing data, XML is emerging as the primary standard for representing and exchanging data. Consequently, it has been increasingly important to provide a uniform XML interface to various data sources — integration; and critical to protect sensitive and confidential information in XML data — access control. Moreover, it is preferable to first detect and repair the inconsistencies in the data to avoid the propagation of errors to other data processing steps. In response to these challenges, this thesis presents an integrated framework for cleaning, integrating and securing data. The framework contains three parts. First, the data cleaning sub-framework makes use of a new class of constraints specially designed for improving data quality, referred to as conditional functional dependencies (CFDs), to detect and remove inconsistencies in relational data. Both batch and incremental techniques are developed for detecting CFD violations by SQL efficiently and repairing them based on a cost model. The cleaned relational data, together with other non-XML data, is then converted to XML format by using widely deployed XML publishing facilities. Second, the data integration sub-framework uses a novel formalism, XML integration grammars (XIGs), to integrate multi-source XML data which is either native or published from traditional databases. XIGs automatically support conformance to a target DTD, and allow one to build a large, complex integration via composition of component XIGs. To efficiently materialize the integrated data, algorithms are developed for merging XML queries in XIGs and for scheduling them. Third, to protect sensitive information in the integrated XML data, the data security sub-framework allows users to access the data only through authorized views. User queries posed on these views need to be rewritten into equivalent queries on the underlying document to avoid the prohibitive cost of materializing and maintaining large number of views. Two algorithms are proposed to support virtual XML views: a rewriting algorithm that characterizes the rewritten queries as a new form of automata and an evaluation algorithm to execute the automata-represented queries. They allow the security sub-framework to answer queries on views in linear time. Using both relational and XML technologies, this framework provides a uniform approach to clean, integrate and secure data. The algorithms and techniques in the framework have been implemented and the experimental study verifies their effectiveness and efficiency.

[1]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[2]  Cong Yu,et al.  Constraint-based XML query rewriting for data integration , 2004, SIGMOD '04.

[3]  Maarten Marx,et al.  Specifying access control policies for XML documents with XPath , 2004, SACMAT '04.

[4]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[5]  Ioana Manolescu,et al.  Answering XML Queries on Heterogeneous Data Sources , 2001, VLDB.

[6]  Francesco Scarcello,et al.  Census Data Repair: a Challenging Application of Disjunctive Logic Programming , 2001, LPAR.

[7]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[8]  Georg Gottlob,et al.  The complexity of XPath query evaluation and XML typing , 2005, JACM.

[9]  Leopoldo E. Bertossi,et al.  Consistent query answering in databases , 2006, SGMD.

[10]  Wenfei Fan,et al.  Composable XML integration grammars , 2004, CIKM '04.

[11]  Eugene J. Shekita,et al.  Querying XML Views of Relational Data , 2001, VLDB.

[12]  Wenfei Fan,et al.  SMOQE: a system for providing secure access to XML , 2006, VLDB.

[13]  Henry F. Korth,et al.  Composing XSL transformations with XML publishing views , 2003, SIGMOD '03.

[14]  Dan Suciu,et al.  Processing XML streams with deterministic automata and stream indexes , 2004, TODS.

[15]  Ioana Manolescu,et al.  Dynamic XML documents with distribution and replication , 2003, SIGMOD '03.

[16]  Wenfei Fan,et al.  A Uniform System for Publishing and Maintaining XML Data , 2004, VLDB.

[17]  Maarten Marx,et al.  XPath with Conditional Axis Relations , 2004, EDBT.

[18]  Laks V. S. Lakshmanan,et al.  Optimizing the Secure Evaluation of Twig Queries , 2002, VLDB.

[19]  Frank Neven,et al.  Automata, Logic, and XML , 2002, CSL.

[20]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[21]  Ravi B. Boppana,et al.  Approximating maximum independent sets by excluding subgraphs , 1992, BIT Comput. Sci. Sect..

[22]  Charles H. Kriebel,et al.  Evaluating the Quality of Information Systems , 1979 .

[23]  Gösta Grahne,et al.  The Problem of Incomplete Information in Relational Databases , 1991, Lecture Notes in Computer Science.

[24]  William E. Winkler,et al.  Methods for evaluating and creating data quality , 2004, Inf. Syst..

[25]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[26]  Andrew W. Moore,et al.  Probabilistic noise identification and data cleaning , 2003, Third IEEE International Conference on Data Mining.

[27]  Jan Chomicki,et al.  Computing consistent query answers using conflict hypergraphs , 2004, CIKM '04.

[28]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[29]  Houari Maaraj Houari Maaraj,et al.  ENTERPRISE INFORMATION PORTALS VS. ENTERPRISE KNOWLEDGE PORTALS , 2010, Dirassat Journal Economic Issue.

[30]  Loreto Bravo,et al.  Efficient Approximation Algorithms for Repairing Inconsistent Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[31]  Jeffrey F. Naughton,et al.  Recursive XML schemas, recursive XML queries, and relational storage: XML-to-SQL query translation , 2004, Proceedings. 20th International Conference on Data Engineering.

[32]  Gabriel M. Kuper,et al.  Structural Properties of XPath Fragments , 2003, ICDT.

[33]  Michael J. Carey,et al.  XPERANTO: Middleware for Publishing Object-Relational Data as XML Documents , 2000, VLDB.

[34]  Hao He,et al.  BOXes: efficient maintenance of order-based labeling for dynamic XML data , 2005, 21st International Conference on Data Engineering (ICDE'05).

[35]  Anura Gurugé,et al.  Universal Description, Discovery, and Integration , 2004 .

[36]  Jan Chomicki,et al.  Answer sets for consistent query answering in inconsistent databases , 2002, Theory and Practice of Logic Programming.

[37]  Minos N. Garofalakis,et al.  Parallel Query Scheduling and Optimization with Time- and Space-Shared Resources , 1997, VLDB.

[38]  Michael J. Carey,et al.  XPERANTO: Publishing Object-Relational Data as XML , 2000, WebDB.

[40]  Yannis Papakonstantinou,et al.  Storing and querying XML data using denormalized relational databases , 2005, The VLDB Journal.

[41]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[42]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[43]  Anthony Kosky,et al.  Transforming databases with recursive data structures , 1996 .

[44]  Rajeev Goré,et al.  A Logical Formalisation of the Fellegi-Holt Method of Data Cleaning , 2003, IDA.

[45]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[46]  R. S. Garfinkel,et al.  Optimal Imputation of Erroneous Data: Categorical Data, General Edits , 1986, Oper. Res..

[47]  Haim Kaplan,et al.  A comparison of labeling schemes for ancestor queries , 2002, SODA '02.

[48]  Jan Chomicki,et al.  Consistent Query Answering: Five Easy Pieces , 2007, ICDT.

[49]  Paul F. Dietz Maintaining order in a linked list , 1982, STOC '82.

[50]  Hamid Pirahesh,et al.  Efficiently publishing relational data as XML documents , 2001, The VLDB Journal.

[51]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[52]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.

[53]  Moshe Y. Vardi Alternating Automata: Unifying Truth and Validity Checking for Temporal Logics , 1997, CADE.

[54]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[55]  Matthias Jarke,et al.  Systematic Development of Data Mining-Based Data Quality Tools , 2003, VLDB.

[56]  Renée J. Miller,et al.  First-order query rewriting for inconsistent databases , 2005, J. Comput. Syst. Sci..

[57]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[58]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[59]  Dan Suciu,et al.  Efficient evaluation of XML middle-ware queries , 2001, SIGMOD '01.

[60]  Stephan Kepser,et al.  A Simple Proof for the Turing-Completeness of XSLT and XQuery , 2004, Extreme Markup Languages®.

[61]  William E. Winkler,et al.  SET-COVERING AND EDITING DISCRETE DATA , 1998 .

[62]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[63]  Jeffrey F. Naughton,et al.  Efficient XML-to-SQL Query Translation: Where to Add the Intelligence? , 2004, VLDB.

[64]  Jef Wijsen,et al.  Database repairing using updates , 2005, TODS.

[65]  Laks V. S. Lakshmanan,et al.  A compressed accessibility map for XML , 2004, TODS.

[66]  Andrea Calì,et al.  On the decidability and complexity of query answering over inconsistent and incomplete databases , 2003, PODS.

[67]  Michael Benedikt,et al.  XML Subtree Queries: Specification and Composition , 2005, DBPL.

[68]  Denilson Barbosa,et al.  ToXgene: An extensible template-based data generator for XML , 2002, WebDB.

[69]  Maarten Marx,et al.  Conditional XPath, the first order complete XPath dialect , 2004, PODS.

[70]  Derick Wood,et al.  Regular tree and regular hedge languages over unranked alphabets , 2001 .

[71]  Éva Tardos,et al.  Scheduling data transfers in a network and the set scheduling problem , 2003, J. Algorithms.

[72]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[73]  Dan Suciu,et al.  Translating XSLT programs to Efficient SQL queries , 2002, WWW '02.

[74]  Leonid Libkin Logics for Unranked Trees: An Overview , 2005, ICALP.

[75]  Sumit Ganguly,et al.  Optimizing View Queries in ROLEX to Support Navigable Result Trees , 2002, VLDB.

[76]  Elke A. Rundensteiner,et al.  Order-Sensitive View Maintenance of Materialized XQuery Views , 2003, ER.

[77]  Jan Chomicki,et al.  Scalar Aggregation in FD-Inconsistent Databases , 2001, ICDT.

[78]  Noga Alon,et al.  XML with data values: typechecking revisited , 2003, J. Comput. Syst. Sci..

[79]  Wolfgang Thomas Logical Aspects in the Study of Tree Languages , 1984, CAAP.

[80]  Ulrich Güntzer,et al.  Data Quality Mining - Making a Virute of Necessity , 2001, DMKD.

[81]  Leopoldo E. Bertossi,et al.  Complexity of Consistent Query Answering in Databases Under Cardinality-Based and Incremental Repair Semantics , 2006, ICDT.

[82]  Leopoldo E. Bertossi,et al.  Fixing inconsistent databases by updating numerical attributes , 2005, 16th International Workshop on Database and Expert Systems Applications (DEXA'05).

[83]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[84]  Wolfgang Faber,et al.  The DLV system for knowledge representation and reasoning , 2002, TOCL.

[85]  Wenfei Fan,et al.  Rewriting Regular XPath Queries on XML Views , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[86]  Andrzej Ehrenfeucht,et al.  Complexity measures for regular expressions , 1974, STOC '74.

[87]  Ernesto Damiani,et al.  Securing XML Documents , 2000, EDBT.

[88]  Xin He,et al.  Scalar aggregation in inconsistent databases , 2003, Theor. Comput. Sci..

[89]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[90]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[91]  Thomas Schwentick,et al.  XML: Model, Schemas, Types, Logics, and Queries , 2003, Logics for Emerging Applications of Databases.

[92]  William E. Winkler,et al.  BALANCING AND RATIO EDITING WITH THE NEW SPEER SYSTEM , 2002 .

[93]  Riccardo Rosati,et al.  Consistent query answering under key and exclusion dependencies: algorithms and experiments , 2005, CIKM '05.

[94]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[95]  Dan Suciu,et al.  Containment and equivalence for a fragment of XPath , 2004, JACM.

[96]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[97]  Abraham Silberschatz,et al.  Database Systems Concepts , 1997 .

[98]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[99]  Doheon Lee,et al.  A Taxonomy of Dirty Data , 2004, Data Mining and Knowledge Discovery.

[100]  E. Goris,et al.  Looping caterpillars [semistructured data querying] , 2005, 20th Annual IEEE Symposium on Logic in Computer Science (LICS' 05).

[101]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[102]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[103]  Sergio Greco,et al.  A Logical Framework for Querying and Repairing Inconsistent Databases , 2003, IEEE Trans. Knowl. Data Eng..

[104]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[105]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[106]  Marcelo Arenas,et al.  XML data exchange: consistency and query answering , 2005, PODS '05.

[107]  Aoying Zhou,et al.  DTD-Directed Publishing with Attribute Translation Grammars , 2002, VLDB.

[108]  Kenneth Salem,et al.  Compact access control labeling for efficient secure XML query evaluation , 2007, Data Knowl. Eng..

[109]  Jaikumar Radhakrishnan,et al.  Greed is good: Approximating independent sets in sparse and bounded-degree graphs , 1997, Algorithmica.

[110]  Georg Gottlob,et al.  Complexity and expressive power of logic programming , 2001, CSUR.

[111]  Byron Choi,et al.  What are real DTDs like? , 2002, WebDB.

[112]  Maarten Marx,et al.  Axiomatizing the Logical Core of XPath 2.0 , 2008, Theory of Computing Systems.

[113]  Alin Deutsch,et al.  MARS: A System for Publishing XML from Mixed and Redundant Storage , 2003, VLDB.

[114]  Frank Neven,et al.  Automata theory for XML researchers , 2002, SGMD.

[115]  S. Doaitse Swierstra,et al.  Higher order attribute grammars , 1989, PLDI '89.

[116]  Dan Suciu,et al.  SilkRoute: trading between relations and XML , 2000, Comput. Networks.

[117]  Steven A. Wolfman,et al.  Cleaning Data with Bayesian Methods , 2000 .

[118]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[119]  Ken Orr,et al.  Data quality and systems theory , 1998, CACM.

[120]  Jan Chomicki,et al.  Query Answering in Inconsistent Databases , 2003, Logics for Emerging Applications of Databases.

[121]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[122]  Wenfei Fan,et al.  Secure XML querying with security views , 2004, SIGMOD '04.

[123]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[124]  William E. Winkler,et al.  STATE OF STATISTICAL DATA EDITING AND CURRENT RESEARCH PROBLEMS , 1999 .

[125]  Luc Bouganim,et al.  Dynamic query scheduling in data integration systems , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[126]  Katherine J. Thompson,et al.  Results of Evaluation of AGGIES for ACES , 2000 .

[127]  Wenfei Fan,et al.  Incremental evaluation of schema-directed XML publishing , 2004, SIGMOD '04.

[128]  Donald P. Ballou,et al.  Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems , 1985 .

[129]  Sophie Cluet,et al.  Your mediators need data conversion! , 1998, SIGMOD '98.

[130]  Jan Chomicki,et al.  Invited Paper: Consistent Query Answering: Opportunities and Limitations , 2006, 17th International Workshop on Database and Expert Systems Applications (DEXA'06).

[131]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[132]  Christoph Koch,et al.  Efficient Processing of Expressive Node-Selecting Queries on XML Data in Secondary Storage: A Tree Automata-based Approach , 2003, VLDB.

[133]  Serge Abiteboul On Views and XML , 1999, PODS.

[134]  Rajeev Rastogi,et al.  Capturing both types and constraints in data integration , 2003, SIGMOD '03.

[135]  Konstantinos Sagonas,et al.  XSB as an efficient deductive database engine , 1994, SIGMOD '94.

[136]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[137]  Ada Wai-Chee Fu,et al.  Integration and efficient lookup of compressed XML accessibility maps , 2005, IEEE Transactions on Knowledge and Data Engineering.

[138]  Peng Liu,et al.  QFilter: fine-grained run-time XML access control via NFA-based query rewriting , 2004, CIKM '04.

[139]  Makoto Murata,et al.  XML access control using static analysis , 2006, TSEC.

[140]  Daniel M. Yellin,et al.  Composable attribute grammars: support for modularity in translator design and implementation , 1992, POPL '92.

[141]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[142]  James Clark,et al.  XSL Transformations (XSLT) Version 1.0 , 1999 .

[143]  Noga Alon,et al.  Typechecking XML views of relational databases , 2001, Proceedings 16th Annual IEEE Symposium on Logic in Computer Science.

[144]  Chaitanya K. Baru,et al.  XML-based information mediation with MIX , 1999, SIGMOD '99.

[145]  Thomas Schwentick,et al.  Expressiveness and complexity of XML Schema , 2006, TODS.

[146]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[147]  Yanlei Diao,et al.  YFilter: efficient and scalable filtering of XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[148]  Ronald Fagin,et al.  Translating Web Data , 2002, VLDB.

[149]  Frank Neven,et al.  Extensions of Attribute Grammars for Structured Document Queries , 1999, DBPL.

[150]  Michael Benedikt,et al.  XPath leashed , 2009, CSUR.

[151]  Dan Suciu,et al.  SilkRoute: A framework for publishing relational data in XML , 2002, TODS.

[152]  Leopoldo E. Bertossi,et al.  Querying Inconsistent Databases: Algorithms and Implementation , 2000, Computational Logic.

[153]  Hongjun Lu,et al.  Query translation from XPath to SQL in the presence of recursive DTDs , 2009, The VLDB Journal.

[154]  Serge Abiteboul,et al.  Exchanging intensional XML data , 2003, TODS.

[155]  William E. Winkler EDITING DISCRETE DATA , 1997 .

[156]  Jaideep Srivastava,et al.  Entity Identification in Database Integration , 1996, Inf. Sci..

[157]  Ravi B. Boppana,et al.  Approximating maximum independent sets by excluding subgraphs , 1990, BIT.

[158]  Luc Bouganim,et al.  Dynamic Load Balancing in Hierarchical Parallel Database Systems , 1996, VLDB.

[159]  Richard Y. Wang,et al.  A product perspective on total data quality management , 1998, CACM.

[160]  Dexter Kozen,et al.  Kleene algebra with tests , 1997, TOPL.

[161]  Alin Deutsch,et al.  Storing semistructured data with STORED , 1999, SIGMOD '99.

[162]  Antonio Sassano,et al.  Errors Detection and Correction in Large Scale Data Collecting , 2001, IDA.

[163]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[164]  Bor-Chung Chen,et al.  Set Covering Algorithms in Edit Generation , 1998 .

[165]  Rajeev Motwani,et al.  Coloring Away Communication in Parallel Query Optimization , 1995, VLDB.

[166]  Serge Abiteboul,et al.  XML Data Integration with Identification , 2005, DBPL.