Efficient evaluation of xml queries

The contributions in this thesis focus on processing XML queries using an algebra and on exploiting optimization opportunities to enhance execution performance. Algebraic native XQuery implementation has focused on efficient evaluation of XPath expressions. However, little has been done on efficient evaluation of queries as a whole. As our first contribution, we introduce the Generalized Tree Pattern (GTP) as a concise representation of XQueries. Evaluating the query reduces to finding matches for its GTP. Using this idea we develop efficient evaluation plans that significantly outperform the competition. A bulk algebra requires manipulation of sets of objects that are structurally homogeneous; but this statement is in contrast with the nature of XML query processing. We address this problem with the introduction of Annotated Pattern Trees and Tree Logical Classes. We show that it is possible to define bulk operations on structurally heterogeneous sets of trees by inducing homogeneity through a tree logical class reduction. We define a Tree Logical Class (TLC) algebra and demonstrate its utility in evaluating XQuery. We show that TLC produces better performing evaluation plans than competing tree algebra techniques. XML and XQuery semantics are order sensitive. The order is determined upon elaborate explicit and implicit parameters. Determining the correct output order, as well as the order of each operator while optimizing the placement of SORTs is a non-trivial procedure that can significantly affect the query evaluation performance. Our solution uses Hybrid Collections annotated with Ordering and Duplicate Specifications. We show how we produce the correct output order while allowing for algebraic rewrites that enhance performance. XML is treated as schema-less to allow for evaluation techniques to be executed in the absence of schema. We assume schema knowledge exists and try to explore using it as a performance enhancing tool. For our last contribution, we show practical structures to store metadata knowledge, the Schema Information Graph (SIG) and the Alternate Paths, and algorithms that take advantage of them within the constraints of an optimizer. Optimized plans are shown to significantly outperform 'naive' ones.