Information Retrieval of Sequential Data in Heterogeneous XML Databases

The XML language is a W3C standard sustained by both the industry and the scientific community. Therefore, the available information annotated in XML keeps and will keep increasing in size. Furthermore, not only the volume of the XML information is increasing but also its complexity. The XML documents evolved from plain structured text representations, to documents having complex and heterogeneous structures and contents like sequential or time series data. In this article we introduce a retrieval scheme designed to manage sequential data in an XML context based on two levels of approximation: on the structural localization/organization of the sequential data and on its content. To this end we merge methods developed in two different research areas: XML information retrieval and sequence similarity search.

[1]  Conrado Martínez,et al.  Randomized binary search trees , 1998, JACM.

[2]  Norbert Fuhr,et al.  XIRQL: An XML query language based on information retrieval concepts , 2004, TOIS.

[3]  Christos Faloutsos,et al.  Efficient retrieval of similar time sequences under time warping , 1998, Proceedings 14th International Conference on Data Engineering.

[4]  Jignesh M. Patel,et al.  OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences , 2003, VLDB.

[5]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[6]  Arvind Malhotra,et al.  XML Schema Part 2: Datatypes Second Edition , 2004 .

[7]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[8]  Wesley W. Chu,et al.  Similarity search of time-warped subsequences via a suffix tree , 2003, Inf. Syst..

[9]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[10]  Kaizhong Zhang,et al.  An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Cecilia R. Aragon,et al.  Randomized search trees , 2005, Algorithmica.

[12]  Laks V. S. Lakshmanan,et al.  FleXPath: flexible structure and full-text querying for XML , 2004, SIGMOD '04.

[13]  Gabriella Kazai,et al.  INEX 2005 Multimedia Track , 2005, INEX.

[14]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[15]  Xerox Polo,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[16]  Sihem Amer-Yahia,et al.  Approximate matching in XML , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[17]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[18]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[19]  Malcolm P. Atkinson,et al.  Database indexing for large DNA and protein sequence collections , 2002, The VLDB Journal.

[20]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[21]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[22]  Dennis Shasha,et al.  Warping indexes with envelope transforms for query by humming , 2003, SIGMOD '03.

[23]  Edleno Silva de Moura,et al.  Measuring similarity between collection of values , 2004, WIDM '04.

[24]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .