Measuring similarity between collection of values

In this paper, we propose a set of similarity metrics for manipulating collections of values occuring in XML documents. Following the data model presented in TAX algebra, we treat an XML element as a labeled ordered rooted tree. Consider that XML nodes can be either atomic, i.e, they may contain single values such as short character strings, date, etc, or complex, i.e., nested structures that contain other nodes, we propose two types of similarity metrics: MAVs, for atomic nodes and MCVs, for complex nodes. In the first case, we suggest the use of several application domain dependent metrics. In the second case, we define metrics for complex values that are structure dependent, and can be distinctly applied for it and collections of values. We also present experiments showing the effectiveness of our method.

[1]  C. Lee Giles,et al.  Autonomous citation matching , 1999, AGENTS '99.

[2]  Maarten de Rijke,et al.  XML retrieval: what to retrieve? , 2003, SIGIR '03.

[3]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[4]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[5]  Carlos Alberto Heuser,et al.  Twisting the Metric Space to Achieve Better Metric Trees , 2004, SBBD.

[6]  Cong Yu,et al.  Querying structured text in an XML database , 2003, SIGMOD '03.

[7]  Amihai Motro,et al.  VAGUE: a user interface to relational databases that permits vague queries , 1988, TOIS.

[8]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[9]  N. Fuhr An Extension of XQL for Information Retrieval , 2000 .

[10]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[11]  Altigran Soares da Silva,et al.  Finding similar identities among objects from multiple web sources , 2003, WIDM '03.

[12]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[13]  Ioana Manolescu,et al.  Integrating Keyword Search into XML Query Processing , 2000, BDA.

[14]  Michael S. Lew,et al.  Principles of Visual Information Retrieval , 2001, Advances in Pattern Recognition.

[15]  Alberto Del Bimbo,et al.  Visual information retrieval , 1999 .

[16]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[17]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[18]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[19]  Gad M. Landau,et al.  An Extension of the Vector Space Model for Querying XML Documents via XML Fragments 1 , 2002 .