Structure-aware XML Object Identification

The object identi cation problem is particu- larly hard for XML data, due to its struc- tural exibility. Tree edit distances have been proposed for approximate comparisons among XML trees. However, such distances ignore the semantics implicit in XML data structure, and their use is computationally infeasible for unordered data. In this paper, we de ne a new distance for XML data, the structure aware XML distance, that overcomes these issues, together with a polynomial-time algorithm to calculate it, and we present experimental re- sult that prove its e ectiveness and e ciency.

[1]  François Bourgeois,et al.  An extension of the Munkres algorithm for the assignment problem to rectangular matrices , 1971, CACM.

[2]  M. Harrison,et al.  Reasoning about Uncertainty in Location Identification with RFID , 2003 .

[3]  Tomasz Imielinski,et al.  On Representing Incomplete Information in a Relational Data Base , 1981, VLDB.

[4]  Garcia-MolinaHector,et al.  Change detection in hierarchically structured information , 1996 .

[5]  Yunhao Liu,et al.  Contour map matching for event detection in sensor networks , 2006, SIGMOD Conference.

[6]  Anthony K. H. Tung,et al.  Indexing Mixed Types for Approximate Retrieval , 2005, VLDB.

[7]  Kaizhong Zhang,et al.  A constrained edit distance between unordered labeled trees , 1996, Algorithmica.

[8]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[9]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[10]  J. Moon,et al.  On cliques in graphs , 1965 .

[11]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[12]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[13]  Tao Lin,et al.  Integrating Automatic Data Acquisition with Business Processes - Experiences with SAP's Auto-ID Infrastructure , 2004, VLDB.

[14]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[15]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[16]  Jonathan Robie,et al.  Document Object Model (DOM) Level 2 Specification , 1998 .

[17]  Thorsten Richter,et al.  A New Measure of the Distance between Ordered Trees and its Applications , 1997 .

[18]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[19]  Gerhard Weikum,et al.  Probabilistic Ranking of Database Query Results , 2004, VLDB.

[20]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[21]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[22]  Theodore Johnson,et al.  Data quality and data cleaning: an overview , 2003, SIGMOD '03.

[23]  Mohamed A. Sharaf,et al.  Balancing energy efficiency and quality of aggregate data in sensor networks , 2004, The VLDB Journal.

[24]  Dimitrios Gunopulos,et al.  Distributed deviation detection in sensor networks , 2003, SGMD.

[25]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[26]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[27]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[28]  Divesh Srivastava,et al.  On effective multi-dimensional indexing for strings , 2000, SIGMOD '00.

[29]  Maria-Esther Vidal,et al.  Querying Quality of Data Metadata , 1998 .

[30]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[31]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[32]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[33]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[34]  Fusheng Wang,et al.  Temporal Management of RFID Data , 2005, VLDB.

[35]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[36]  Hector Garcia-Molina,et al.  Generic Entity Resolution with Data Confidences , 2006, CleanDB.

[37]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[38]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[39]  Theodore Johnson,et al.  Exploratory Data Mining and Data Cleaning , 2003 .

[40]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[41]  Gregory J. Pottie,et al.  Wireless integrated network sensors , 2000, Commun. ACM.

[42]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[43]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[44]  Felix Naumann,et al.  DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[45]  Wilfred Ng,et al.  Repairing Inconsistent Merged XML Data , 2003, DEXA.

[46]  Felix Naumann,et al.  XML Duplicate Detection Using Sorted Neighborhoods , 2006, EDBT.

[47]  Ravi B. Boppana,et al.  Approximating maximum independent sets by excluding subgraphs , 1990, BIT.

[48]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[49]  Timos K. Sellis,et al.  Conflict resolution of rules assigning values to virtual attributes , 1989, SIGMOD '89.

[50]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[51]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[52]  Ran Wolff,et al.  Noname manuscript No. (will be inserted by the editor) In-Network Outlier Detection in Wireless Sensor Networks , 2022 .

[53]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[54]  Hongxing He,et al.  A comparative study of RNN for outlier detection in data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[55]  JAMAL N. AL-KARAKI,et al.  Routing techniques in wireless sensor networks: a survey , 2004, IEEE Wireless Communications.

[56]  Harald Vogt,et al.  Efficient Object Identification with Passive RFID Tags , 2002, Pervasive.

[57]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[58]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.

[59]  Z. Meral Özsoyoglu,et al.  Distance based indexing for string proximity search , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[60]  Jon Doyle,et al.  A Truth Maintenance System , 1979, Artif. Intell..

[61]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[62]  Chris Clifton,et al.  Multidatabase Query Processing with Uncertainty in Global Keys and Attribute Values , 1998, J. Am. Soc. Inf. Sci..

[63]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[64]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[65]  R. Boppana Approximating Maximum Independent Sets by Excluding Subgraphs 1 , 1990 .

[66]  John Anderson,et al.  Wireless sensor networks for habitat monitoring , 2002, WSNA '02.

[67]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[68]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[69]  Suman Nath,et al.  Tributaries and deltas: efficient and robust aggregation in sensor network streams , 2005, SIGMOD '05.

[70]  A. Madansky Identification of Outliers , 1988 .

[71]  Carlo Zaniolo,et al.  Design and Implementation of a Logic Based Language for Data Intensive Applications , 1988, ICLP/SLP.

[72]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[73]  Laura Giordano,et al.  Extending Negation As Failure by Abduction: A Three-Valued Stable Model Semantics , 1996, J. Log. Program..

[74]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[75]  Chen Li,et al.  Selectivity Estimation for Fuzzy String Predicates in Large Data Sets , 2005, VLDB.

[76]  Sharma Chakravarthy,et al.  Composite Events for Active Databases: Semantics, Contexts and Detection , 1994, VLDB.

[77]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[78]  Narain H. Gehani,et al.  Composite Event Specification in Active Databases: Model & Implementation , 1992, VLDB.

[79]  Kaizhong Zhang,et al.  Exact and approximate algorithms for unordered tree matching , 1994, IEEE Trans. Syst. Man Cybern..

[80]  Noam Slonim,et al.  The Information Bottleneck : Theory and Applications , 2006 .

[81]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[82]  Jignesh M. Patel,et al.  Structural joins: a primitive for efficient XML query pattern matching , 2002, Proceedings 18th International Conference on Data Engineering.

[83]  Ahmad Ashari,et al.  Storing And Querying XML Data Using RDBMS , 2004, iiWAS.

[84]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[85]  Wenfei Fan,et al.  Query Optimization for Semistructured Data Using Path Constraints in a Deterministic Data Model , 1999, DBPL.

[86]  Yunhao Liu,et al.  LANDMARC: Indoor Location Sensing Using Active RFID , 2004, Proceedings of the First IEEE International Conference on Pervasive Computing and Communications, 2003. (PerCom 2003)..

[87]  Douglas W. Nychka,et al.  Case Studies in Environmental Statistics , 1998 .

[88]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[89]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[90]  Matthias Lampe,et al.  The Smart Box Application Model , 2004 .

[91]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[92]  Mihalis Yannakakis,et al.  On Generating All Maximal Independent Sets , 1988, Inf. Process. Lett..

[93]  Divesh Srivastava,et al.  Approximate Joins: Concepts and Techniques , 2005, VLDB.

[94]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[95]  Rachel Cardell-Oliver,et al.  A Reactive Soil Moisture Sensor Network: Design and Field Evaluation , 2005, Int. J. Distributed Sens. Networks.

[96]  Robert E. Mercer,et al.  Properties of maximal cliques of a pair-wise compatibility graph for three nonmonotonic reasoning system , 2003, Answer Set Programming.

[97]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[98]  Tao Jiang,et al.  Alignment of Trees - An Alternative to Tree Edit , 1994, Theor. Comput. Sci..

[99]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[100]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[101]  Bruce G. Lindsay,et al.  On Maintaining Priorities in a Production Rule System , 1991, VLDB.

[102]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[103]  Samy Bengio,et al.  Semi-supervised adapted HMMs for unusual event detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[104]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[105]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[106]  Sergio Greco,et al.  Repairs and Consistent Answers for XML Data with Functional Dependencies , 2003, Xsym.

[107]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[108]  Stan Salvador,et al.  FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space , 2004 .

[109]  Laks V. S. Lakshmanan,et al.  HePToX: Marrying XML and Heterogeneity in Your P2P Databases , 2005, VLDB.

[110]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[111]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[112]  R. Sunderraman,et al.  A Generalized Relational Model for Indefinite and Maybe Information , 1991, IEEE Trans. Knowl. Data Eng..

[113]  Suk Kyoon Lee,et al.  An Extended Relational Database Model for Uncertain and Imprecise Information , 1992, VLDB.

[114]  Felix Naumann,et al.  Detecting duplicate objects in XML documents , 2004, IQIS '04.

[115]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[116]  John R. Smith,et al.  Supporting Incremental Join Queries on Ranked Inputs , 2001, VLDB.

[117]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[118]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[119]  Keith W. Kintigh,et al.  The Promise and Challenge of Archaeological Data Integration , 2005, American Antiquity.

[120]  K. Selçuk Candan,et al.  A Unified Treatment of Null Values Using Constraints , 1995, Inf. Sci..

[121]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[122]  Tok Wang Ling,et al.  A Data Model for Semistructured Data with Partial and Inconsistent Information , 2000, EDBT.

[123]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[124]  Luis Gravano,et al.  Selectivity estimation for string predicates: overcoming the underestimation problem , 2004, Proceedings. 20th International Conference on Data Engineering.

[125]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[126]  Jignesh M. Patel,et al.  Structural join order selection for XML query optimization , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[127]  Victoria J. Hodge,et al.  A Comparison of Standard Spell Checking Algorithms and a Novel Binary Neural Approach , 2003, IEEE Trans. Knowl. Data Eng..

[128]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[129]  Yozo Hida,et al.  Aggregation Query Under Uncertainty in Sensor Networks CS 252 Project , 2003 .

[130]  Luis Gravano,et al.  Top-k selection queries over relational databases: Mapping strategies and performance evaluation , 2002, TODS.

[131]  George V. Moustakides,et al.  A Bayesian decision model for cost optimal record matching , 2003, The VLDB Journal.