Superimposed Code-Based Indexing Method for Extracting MCTs from XML Documents

With the exponential increase in the amount of XML data on the Internet, information retrieval techniques on tree-structured XML documents such as keyword search become important. The search results for this retrieval technique are often represented by minimum connecting trees (MCTs) rooted at the lowest common ancestors (LCAs) of the nodes containing all the search keywords. Recently, effective methods such as the stack-based algorithm for generating the lowest grouped distance MCTs (GDMCTs), which derive a more compact representation of the query results, have been proposed. However, when the XML documents and the number of search keywords become large, these methods are still expensive. To achieve more efficient algorithms for extracting MCTs, especially lowest GDMCTs, we first consider two straightforward LCA detection methods: keyword B+trees with Dewey-order labels and superimposed code-based indexing methods. Then, we propose a method for efficiently detecting the LCAs, which combines the two straightforward indexing methods for LCA detection. We also present an effective solution for the false drop problem caused by the superimposed code. Finally, the proposed LCA detection methods are applied to generate the lowest GDMCTs. We conduct detailed experiments to evaluate the benefits of our proposed algorithms and show that the proposed combined method can completely solve the false drop problem and outperforms the stack-based algorithm in extracting the lowest GDMCTs.

[1]  Andrea E. F. Clementi,et al.  Distributed multi-broadcast in unknown radio networks , 2001, PODC '01.

[2]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  S. Wei Secure Frameproof Codes, Key Distribution Patterns, Group Testing Algorithms and Related Structures , 1997 .

[4]  Yannis Papakonstantinou,et al.  Efficient keyword search for smallest LCAs in XML databases , 2005, SIGMOD '05.

[5]  Cong Yu,et al.  Schema-Free XQuery , 2004, VLDB.

[6]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[7]  Martin E. Dyer,et al.  On key storage in secure networks , 1995, Journal of Cryptology.

[8]  Albrecht Schmidt,et al.  Kalchas: a dynamic XML search engine , 2005, CIKM '05.

[9]  Alfred V. Aho,et al.  On finding lowest common ancestors in trees , 1973, SIAM J. Comput..

[10]  Matti Nykänen,et al.  Finding Lowest Common Ancestors in Arbitrarily Directed Trees , 1994, Inf. Process. Lett..

[11]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[12]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Uzi Vishkin,et al.  On Finding Lowest Common Ancestors: Simplification and Parallelization , 1988, AWOC.

[14]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[15]  Richard C. Singleton,et al.  Nonrandom binary superimposed codes , 1964, IEEE Trans. Inf. Theory.

[16]  Andrea E. F. Clementi,et al.  Selective families, superimposed codes, and broadcasting on unknown radio networks , 2001, SODA '01.

[17]  Robert E. Tarjan,et al.  Applications of Path Compression on Balanced Trees , 1979, JACM.

[18]  Menzo Windhouwer,et al.  Querying XML documents made easy: nearest concept queries , 2001, Proceedings 17th International Conference on Data Engineering.

[19]  Divesh Srivastava,et al.  Keyword proximity search in XML trees , 2006, IEEE Transactions on Knowledge and Data Engineering.

[20]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[21]  Surajit Chaudhuri,et al.  DBXplorer: enabling keyword search over relational databases , 2002, SIGMOD '02.