Highly Heterogeneous XML Collections: How to Retrieve Precise Results?

Highly heterogeneous XML collections are thematic collections exploiting different structures: the parent-child or ancestor-descendant relationships are not preserved and vocabulary discrepancies in the element names can occur. In this setting current approaches return answers with low precision. By means of similarity measures and semantic inverted indices we present an approach for improving the precision of query answers without compromising performance.

[1]  Jennifer Widom Data Management for XML: Research Directions , 1999, IEEE Data Eng. Bull..

[2]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[3]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[4]  Felix Naumann,et al.  Approximate tree embedding for querying XML data , 2000 .

[5]  Diego Calvanese,et al.  The Description Logic Handbook , 2007 .

[6]  Jennifer Widom,et al.  Database System Implementation , 2000 .

[7]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[8]  Daniel G. McDonald,et al.  The Conceptualization and Measurement of Diversity , 2003, Commun. Res..

[9]  Alexander Borgida,et al.  Efficient management of transitive relationships in large data and knowledge bases , 1989, SIGMOD '89.

[10]  Rafael Berlanga Llavori,et al.  Automatic Generation of Semantic Fields for Resource Discovery in the Semantic Web , 2005, DEXA.

[11]  Frantisek Plasil,et al.  Behavior Protocols for Software Components , 2002, IEEE Trans. Software Eng..

[12]  Gerhard Weikum,et al.  Ontology-Enabled XML Search , 2003, Intelligent Search on XML Data.

[13]  Jennifer Widom,et al.  Querying Semistructured Heterogeneous Information , 1995, J. Syst. Integr..

[14]  Richard McClatchey,et al.  Health-e-Child: An Integrated Biomedical Platform for Grid-Based Paediatric Applications , 2006, HealthGrid.

[15]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[16]  Kam-Fai Wong,et al.  Answering XML Queries Using Path-Based Indexes: A Survey , 2006, World Wide Web.

[17]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[18]  devised by Melvil Dewey Dewey decimal classification and relative index , 1989 .

[19]  Nicholas Kushmerick,et al.  Similarity-based Queries for XML Databases Using ELIXIR , 2001, WWW Posters.

[20]  Vijay V. Raghavan,et al.  Bitmap Indexing-based Clustering and Retrieval of XML Documents , 2001 .

[21]  Amélie Marian,et al.  Implementing Xquery 1.0: The Galax Experience , 2003, VLDB.

[22]  Peter Fankhauser XQuery formal semantics state and challenges , 2001, SGMD.

[23]  Philip S. Yu,et al.  ViST: a dynamic index method for querying XML data by tree structures , 2003, SIGMOD '03.

[24]  Paul J. Walmsley,et al.  XML Schema Part 0: Primer Second Edition , 2004 .

[25]  Gabriella Kazai,et al.  The INEX Evaluation Initiative , 2003, Intelligent Search on XML Data.

[26]  Hamid Pirahesh,et al.  Extending XQuery for analytics , 2005, SIGMOD '05.

[27]  Sihem Amer-Yahia,et al.  Texquery: a full-text search extension to xquery , 2004, WWW '04.

[28]  not Cwi,et al.  XHTML™ 1.0 The Extensible HyperText Markup Language , 2002 .

[29]  Haim Kaplan,et al.  A comparison of labeling schemes for ancestor queries , 2002, SODA '02.

[30]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[31]  Sihem Amer-Yahia,et al.  Structure and Content Scoring for XML , 2005, VLDB.

[32]  Torsten Schlieder Similarity Search in XML Data using Cost-Based Query Transformations , 2001, WebDB.

[33]  W. Bossert,et al.  The Measurement of Diversity , 2001 .

[34]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[35]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[36]  Mark Klein,et al.  How Similar Is It? Towards Personalized Similarity Measures in Ontologies , 2005, Wirtschaftsinformatik.

[37]  Anthony K. H. Tung,et al.  Similarity evaluation on tree-structured data , 2005, SIGMOD '05.

[38]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[39]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[40]  Bongki Moon,et al.  PRIX: indexing and querying XML using prufer sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[41]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[42]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[43]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[44]  Alin Deutsch,et al.  A Query Language for XML , 1999, Comput. Networks.

[45]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[46]  Yehoshua Sagiv,et al.  Flexible queries over semistructured data , 2001, PODS '01.

[47]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[48]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[49]  Denilson Barbosa,et al.  ToXgene: An extensible template-based data generator for XML , 2002, WebDB.

[50]  Letizia Tanca,et al.  Blind Queries to XML Data , 2000, DEXA.

[51]  Jiawei Han,et al.  Progressive and selective merge: computing top-k with ad-hoc ranking functions , 2007, SIGMOD '07.

[52]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[53]  Torsten. Grust,et al.  Accelerating XPath location steps , 2002, SIGMOD '02.

[54]  Jérôme Siméon,et al.  Put a Tree Pattern in Your Algebra , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[55]  Jeffrey F. Naughton,et al.  Covering indexes for branching path queries , 2002, SIGMOD '02.

[56]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[57]  Daniela Florescu,et al.  Quilt: An XML Query Language for Heterogeneous Data Sources , 2000, WebDB.

[58]  Mohand-Said Hacid,et al.  On the Midpoint of a Set of XML Documents , 2005, DEXA.

[59]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[60]  Gerhard Weikum,et al.  The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking , 2002, EDBT.

[61]  Jonathan Robie,et al.  Editors , 2003 .

[62]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[63]  Wesley W. Chu,et al.  Configurable indexing and ranking for XML information retrieval , 2004, SIGIR '04.

[64]  Laks V. S. Lakshmanan,et al.  Complex Group-By Queries for XML , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[65]  Vassilis Christophides,et al.  Optimizing taxonomic semantic web queries using labeling schemes , 2004, J. Web Semant..

[66]  Pierre P. Lévy Pixelization Paradigm: Outline of a Formal Approach , 2006, VIEW.

[67]  Kaizhong Zhang,et al.  ATreeGrep: approximate searching in unordered trees , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[68]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[69]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[70]  Maliha S. Nash,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.

[71]  Barbara Catania,et al.  A clustering approach for XML linked documents , 2002, Proceedings. 13th International Workshop on Database and Expert Systems Applications.

[72]  Rafael Berlanga Llavori,et al.  Approximate Subtree Identification in Heterogeneous XML Documents Collections , 2005, XSym.

[73]  Gerhard Weikum,et al.  Intelligent Search on XML Data , 2003, Lecture Notes in Computer Science.

[74]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[75]  Stefano Ceri,et al.  Comparative analysis of five XML query languages , 1999, SGMD.

[76]  V. S. Subrahmanian,et al.  A multi-similarity algebra , 1998, SIGMOD '98.

[77]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[78]  Martin Theobald TopX: efficient and versatile top-k query processing for text, structured, and semistructured data , 2006 .

[79]  James Allan,et al.  A survey in indexing and searching XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[80]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[81]  Sihem Amer-Yahia,et al.  Tree Pattern Relaxation , 2002, EDBT.

[82]  Giovanna Guerrini,et al.  An Overviewof Similarity Measures for Clustering XML Documents , 2007 .

[83]  Maurizio Lenzerini,et al.  Description Logics with Inverse Roles, Functional Restrictions, and N-ary Relations , 1994, JELIA.

[84]  Sihem Amer-Yahia,et al.  Adaptive processing of top-k queries in XML , 2005, 21st International Conference on Data Engineering (ICDE'05).

[85]  Torsten Schlieder Schema-Driven Evaluation of Approximate Tree-Pattern Queries , 2002, EDBT.

[86]  Wolfgang Meier,et al.  eXist: An Open Source Native XML Database , 2002, Web, Web-Services, and Database Systems.

[87]  Niklaus Wirth,et al.  Type extensions , 1988, TOPL.

[88]  David Schach,et al.  XML Query Language (XQL) , 1998, QL.

[89]  Paul F. Dietz,et al.  Two algorithms for maintaining order in a list , 1987, STOC.

[90]  Rafael Berlanga Llavori,et al.  ArHeX: An Approximate Retrieval System for Highly Heterogeneous XML Document Collections , 2006, EDBT.

[91]  Kyuseok Shim,et al.  APEX: an adaptive path index for XML data , 2002, SIGMOD '02.

[92]  Gianni Costa,et al.  A Tree-Based Approach to Clustering XML Documents by Structure , 2004, PKDD.

[93]  Jérôme Euzenat,et al.  OLA in the OAEI 2005 Alignment Contest , 2005, Integrating Ontologies.

[94]  Rafael Berlanga Llavori,et al.  XML Schemata Inference and Evolution , 2003, DEXA.

[95]  Miquel Salicrú,et al.  Testing the homogeneity of diversity measures: a general framework , 2005 .

[96]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[97]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[98]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[99]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[100]  Mary Fernandez XML Query Languages: Experiences and Exemplars , 2001 .

[101]  Sudarshan S. Chawathe Comparing Hierarchical Data in External Memory , 1999, VLDB.

[102]  Vassilis Christophides,et al.  On wrapping query languages and efficient XML integration , 2000, SIGMOD '00.

[103]  Pekka Kilpeläinen,et al.  Tree Matching Problems with Applications to Structured Text Databases , 2022 .

[104]  Richard McClatchey,et al.  The Management and Integration of Biomedical Knowledge: Application in the Health-e-Child Project (Position Paper) , 2006, OTM Workshops.

[105]  W. John MacMullen,et al.  Information problems in molecular biology and bioinformatics , 2005, J. Assoc. Inf. Sci. Technol..

[106]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[107]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..