XML Structured and Unstructured Query Processing

xml has become a standard for representing, storing and exchanging data, so effectively and efficiently retrieving data from xml data sources becomes increasingly important. In this thesis, we mainly study two types of xml queries: xml structured query and xml keyword search. For xml structured query, we focus on twig pattern matching, which lies in the center of most xml query languages (e.g., XPath, XQuery). Existing twig pattern matching algorithms can be classified into two-phase algorithms and one-phase algorithms. We first propose two novel one-phase holistic twig matching algorithms, TwigMix and TwigFast, which combine the efficient selection of useful elements (introduced in TwigStack) with the simple lists for storing final solutions (introduced in TwigList). TwigMix simply introduces the element selection function getNext of TwigStack into TwigList to avoid manipulation of useless elements in the stack and lists. TwigFast further improves this by introducing some pointers in the lists to completely avoid the use of stacks. On the other hand, previous twig pattern matching algorithms may incur other redundant computation, so we propose two approaches, namely re-test checking and forward-to-end, which can reduce the redundant computation and can be easily applied to both holistic one-phase and two-phase algorithms. Improving the effectiveness of xml keyword search remains an open problem. In this thesis, we first present XKMis, which divides the nodes into meaningful and self-containing information segments, called minimal information segments (MISs), and return MIS-subtrees which consist of MISs that are logically connected by the keywords. The MIS-subtrees are closer to what the user wants. XReal [1] utilizes the statistics of underlying data to resolve keyword ambiguity problems. However, we found their proposed formula for inferring the search-for node type suffers from inconsistency and abnormality problems. Therefore, we propose a dynamic reduction factor scheme as well as a novel algorithm DynamicInfer to resolve these two problems. Then, we resolve the ambiguities of keywords by exploiting users’ typing habit in constructing keyword queries. We propose an approach which infers and ranks a set of likely search intentions. In a search intention, each keyword has a specific meaning. The result subtrees of the inferred likely search intentions are returned to users in clusters, which can significantly save users’ browsing time. Finally we explore the application of query suggestion in xml keyword search and propose a novel interactive xml query system XQSuggest, which mainly targets non-professional users who roughly know the contents of the database. In summary, this thesis presents several novel algorithms to improve the efficiency of twig pattern matching. It also presents several approaches to resolve the ambiguity of keywords and improve the effectiveness of xml keyword search. Statement of Originality This work has not previously been submitted for a degree or diploma in any university. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made in the thesis itself.

[1]  Ziv Bar-Yossef,et al.  The Space Complexity of Processing XML Twig Queries Over Indexed Documents , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  Qin Iris Wang,et al.  Learning Noun Phrase Query Segmentation , 2007, EMNLP.

[3]  Ioana Manolescu,et al.  The XML benchmark project , 2001 .

[4]  Jiaheng Lu,et al.  Effective Keyword Search in XML Documents Based on MIU , 2006, DASFAA.

[5]  Xiaofeng Meng,et al.  On the sequencing of tree structures for XML indexing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[6]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[7]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[8]  Tok Wang Ling,et al.  From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching , 2005, VLDB.

[9]  Hongjun Lu,et al.  Efficient Processing of Twig Queries with OR-Predicates. , 2004, ACM SIGMOD Conference.

[10]  Aoying Zhou,et al.  Hash-Search: An Efficient SLCA-Based Keyword Search Algorithm on XML Documents , 2009, DASFAA.

[11]  Jianyong Wang,et al.  Effective keyword search for valuable lcas over xml documents , 2007, CIKM '07.

[12]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[13]  Jianxin Li,et al.  Suggestion of promising result types for XML keyword search , 2010, EDBT '10.

[14]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[15]  Jianxin Li,et al.  XClean: Providing valid spelling suggestions for XML keyword queries , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[16]  Paul F. Dietz Maintaining order in a linked list , 1982, STOC '82.

[17]  Hongjun Lu,et al.  PBiTree coding and efficient processing of containment joins , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[18]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[19]  Jianxin Li,et al.  Fast ELCA computation for keyword queries on XML data , 2010, EDBT '10.

[20]  Susan B. Davidson,et al.  BLAS: an efficient XPath processing system , 2004, SIGMOD '04.

[21]  Wen-Chi Hou,et al.  Efficient Processing of XML Twig Pattern: A Novel One-Phase Holistic Solution , 2007, DEXA.

[22]  Tok Wang Ling,et al.  Efficient Processing of Ordered XML Twig Pattern , 2005, DEXA.

[23]  Tok Wang Ling,et al.  TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data , 2006, DASFAA.

[24]  Jignesh M. Patel,et al.  Structural join order selection for XML query optimization , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[25]  Philip S. Yu,et al.  ViST: a dynamic index method for querying XML data by tree structures , 2003, SIGMOD '03.

[26]  Lin Guo XRANK : Ranked Keyword Search over XML Documents , 2003 .

[27]  Yannis Papakonstantinou,et al.  Efficient keyword search for smallest LCAs in XML databases , 2005, SIGMOD '05.

[28]  Hongjun Lu,et al.  Holistic Twig Joins on Indexed XML Documents , 2003, VLDB.

[29]  Masatoshi Yoshikawa,et al.  An XML indexing structure with relative region coordinate , 2001, Proceedings 17th International Conference on Data Engineering.

[30]  Jiang Li,et al.  XQSuggest: An Interactive XML Keyword Search System , 2009, DEXA.

[31]  Xiaofeng Meng,et al.  TwigStack+: Holistic twig join pruning using extended solution extension , 2007, Wuhan University Journal of Natural Sciences.

[32]  Bongki Moon,et al.  PRIX: indexing and querying XML using prufer sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[33]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[34]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[35]  Divesh Srivastava,et al.  Efficient Handling of Positional Predicates Within XML Query Processing , 2005, XSym.

[36]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[37]  Yi Chen,et al.  eXtract: a snippet generation system for XML search , 2008, Proc. VLDB Endow..

[38]  Toshiyuki Amagasa,et al.  QRS: a robust numbering scheme for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[39]  Cong Yu,et al.  Schema-Free XQuery , 2004, VLDB.

[40]  C. M. Sperberg-McQueen,et al.  W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures , 2012 .

[41]  Toshiyuki Amagasa,et al.  XRel: a path-based approach to storage and retrieval of XML documents using relational databases , 2001, ACM Trans. Internet Techn..

[42]  Cong Yu,et al.  TIMBER: A native XML database , 2002, The VLDB Journal.

[43]  Jignesh M. Patel,et al.  Structural joins: a primitive for efficient XML query pattern matching , 2002, Proceedings 18th International Conference on Data Engineering.

[44]  Vagelis Hristidis,et al.  Keyword proximity search on XML graphs , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[45]  Yong Zhang,et al.  Efficient Holistic Twig Joins in Leaf-to-Root Combining with Root-to-Leaf Way , 2007, DASFAA.

[46]  Tok Wang Ling,et al.  PathStack : A Holistic Path Join Algorithm for Path Query with Not-Predicates on XML Data , 2005, DASFAA.

[47]  Mong-Li Lee,et al.  An evaluation of XML indexes for structural join , 2004, SGMD.

[48]  W. Bruce Croft,et al.  Refining Keyword Queries for XML Retrieval by Combining Content and Structure , 2009, ECIR.

[49]  Philip S. Yu,et al.  BLINKS: ranked keyword searches on graphs , 2007, SIGMOD '07.

[50]  Rémi Gilleron,et al.  Retrieving meaningful relaxed tightest fragments for XML keyword search , 2009, EDBT '09.

[51]  Kyoungro Yoon,et al.  Index structures for structured documents , 1996, DL '96.

[52]  Beng Chin Ooi,et al.  XR-tree: indexing XML data for efficient structural joins , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[53]  Kam-Fai Wong,et al.  Fast Structural Join with a Location Function , 2006, DASFAA.

[54]  Yi Chen,et al.  Reasoning and identifying relevant matches for XML keyword search , 2008, Proc. VLDB Endow..

[55]  Yi Chen,et al.  Identifying meaningful return information for XML keyword search , 2007, SIGMOD '07.

[56]  Marianne Winslett,et al.  Effective, design-independent XML keyword search , 2009, CIKM.

[57]  Marcus Fontoura,et al.  Optimizing cursor movement in holistic twig joins , 2005, CIKM '05.

[58]  Chee Yong Chan,et al.  Multiway SLCA-based keyword search in XML data , 2007, WWW '07.

[59]  Yehoshua Sagiv,et al.  Interconnection semantics for keyword search in XML , 2005, CIKM '05.

[60]  Tok Wang Ling,et al.  Effective XML Keyword Search with Relevance Oriented Ranking , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[61]  Tok Wang Ling,et al.  MCN: A New Semantics Towards Effective XML Keyword Search , 2009, DASFAA.

[62]  Yi Chen,et al.  Improving XML search by generating and utilizing informative result snippets , 2010, TODS.

[63]  Hua-Gang Li,et al.  Twig2Stack: bottom-up processing of generalized-tree-pattern queries over XML documents , 2006, VLDB.

[64]  Patrick E. O'Neil,et al.  ORDPATHs: insert-friendly XML node labels , 2004, SIGMOD '04.

[65]  Tok Wang Ling,et al.  On boosting holism in XML twig pattern matching using structural indexing techniques , 2005, SIGMOD '05.

[66]  Jiang Li,et al.  Twig Pattern Matching: A Revisit , 2011, DEXA.

[67]  Jianxin Li,et al.  Top-k keyword search over probabilistic XML data , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[68]  Carlo Zaniolo,et al.  Efficient Structural Joins on Indexed XML Documents , 2002, VLDB.

[69]  Tok Wang Ling,et al.  Exploiting ID References for Effective Keyword Search in XML Documents , 2008, DASFAA.

[70]  Raymond K. Wong,et al.  Structural proximity searching for large collections of semi-structured data , 2001, CIKM '01.

[71]  Daniela Florescu,et al.  Quilt: An XML Query Language for Heterogeneous Data Sources , 2000, WebDB.

[72]  Tok Wang Ling,et al.  Efficient processing of XML twig patterns with parent child edges: a look-ahead approach , 2004, CIKM '04.

[73]  Mong-Li Lee,et al.  A Prime Number Labeling Scheme for Dynamic Ordered XML Trees , 2004, ICDE.

[74]  Marcus Fontoura,et al.  Virtual cursors for XML joins , 2004, CIKM '04.

[75]  Rémi Gilleron,et al.  ValidMatch: Retrieving More Reasonable SLCA-Based Result for XML Keyword Search , 2009, DASFAA.

[76]  Jiang Li,et al.  XKMis: effective and efficient keyword search in XML databases , 2009, IDEAS '09.

[77]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[78]  Divesh Srivastava,et al.  Keyword proximity search in XML trees , 2006 .

[79]  Yehoshua Sagiv,et al.  Keyword proximity search in complex data graphs , 2008, SIGMOD Conference.

[80]  Jeffrey Xu Yu,et al.  TwigList : Make Twig Pattern Matching Fast , 2007, DASFAA.

[81]  Jiang Li,et al.  Fast Matching of Twig Patterns , 2008, DEXA.

[82]  Marianne Winslett,et al.  Using structural information in XML keyword search effectively , 2011, TODS.

[83]  Divesh Srivastava,et al.  Answering order-based queries over XML data , 2005, WWW '05.