The Simplest Query Language That Could Possibly Work

The INEX’03 query language proved to be much too complicated for the INEX participants to use well, let alone anyone else. We need something simpler, but not too simple. Something which is basically a hybrid between Boolean IR queries and a stripped down CSS will do the job. 1. INEX NEEDS A QUERY LANGUAGE. In the INEX conferences, we are trying to develop a data collection and a set of queries with known answers that can provide a solid basis for research and experimentation with XML information retrieval. In order to communicate between researchers in the same year, we need a common query language. For INEX’02 there was such a language. In INEX’03 there was another. In order to communicate between the researchers who produce the queries in one year and the researchers who use them in later years, we need a stable, well-defined language. The designer(s) of the INEX’03 query language had every reason to feel pleased. After the INEX’02 query language proved to need revision, surely this was the simplest thing that could possibly work: take an extremely well established XML structural query language (XPath) and add to it a minimal set of features for Information Retrieval. It seems to be agreed that XPath is not a language for the casual user. But this paper is not concerned with user query languages. The query language we need is a query language for use by researchers who are expert in information retrieval and XML. What counts is whether the query language is suitable for us, not users. Unfortunately, the production of this year’s CAS queries proved conclusively that the INEX’03 query language is far too complicated for us: • It proved too hard to use. Of the 30 CAS queries that were selected, 19 (nearly 2 3 ), were either syntactically illegal or otherwise wrong. It took no fewer than 12 rounds of correction before we had a completed collection of queries. • Like many W3C productions, XPath 1.0 is quirky, to put it kindly. It is very powerful in some respects, but there are queries that are very hard to express. For example, //body//ip1//name | //body//ip2//name is legal, but //body//(ip1|ip2)//name is not. • It proved to be hard to implement. Presumably everyone who submitted a query for consideration had already checked it with some XML IR engine; how else could they have known that the query had about the right number of relevant answers? Yet a large number of queries were syntactically or semantically wrong. That should have been noticed. At least one implementor switched the semantics of the / and // operators. • It proved to be hard to implement for another reason. XPath is quite powerful, in ways that are not likely to be useful for information retrieval, and yet if XPath was not implemented in full, were we really implementing the INEX’03 query language? This year, it turned out that most of the power of XPath was not needed. It wasn’t the simplest thing that could possibly have worked. For example, we[23] found that there were 198,041 nodes in the index after ignoring “noise” tags. Yet if ordinal position was also ignored, there were only 10,522 distinct paths. Not one of this year’s selected CAS queries used the ordinal position ([n]) feature of XPath. • XPath has a clear definition of the “string value” of a node; the definition is precise, but given the actual XML markup in the document collection we are working with, it’s not the definition we want. For example, if there is one mention of Joe Bloggs in the collection, as 〈au〉〈fnm〉Joe〈/fnm〉〈snm〉Bloggs〈/snm〉〈/au〉, then the string value is “JoeBloggs” and a search for the word “Bloggs” is guaranteed to miss it. Worse, markup that is supposed to enclose numbers very commonly includes punctuation as well; the rules of XPath say that trying to convert such a string value to numeric form is an error. Yet we want to query it. 2. THE INEX’03 QUERY LANGUAGE WAS TOO HARD TO USE. Every group had to submit 3 CAS and 3 CO queries. These submissions were supposed to have been tested, and known to have a reasonable number (not too high, not too low) of relevant answers. In fact, some answers were provided with each submission. So each submitted query should have been a legal INEX’03 query. From this pool, 30 CAS and 36 CO queries were selected. Of the 30 CAS queries, 19 had either syntax errors or serious semantic errors. The most common semantic error was using the “child” operator / when the “descendant” operator // was intended. This is a shocking error rate. It wasn’t just hard to get the queries right in the first place; it was hard to fix them. It took 12 rounds of corrections before we had a workable set of queries, starting from what were presumably the best queries in the first place. Since a query language based on XPath 1.0 was too hard for us to us, it is impossible to believe that a query language based on the much more complicated XPath 2.0 could be usable by us. 3. WHAT SHOULD WE LOOK FOR IN A QUERY LANGUAGE? 3.1 We want something WE can use. This paper is not about query interfaces or query languages for end users. This paper is solely concerned with query languages for researchers producing or using INEX data. Complexity is not necessarily a problem for us, as long as it is useful complexity. Requiring an intimate knowledge of XML or XML related technologies is not necessarily a problem for us. Requiring lots of punctuation in just the right places is not necessarily a problem for us. While complexity need not be a problem, we need to take a step back and start with something much simpler than XPath, because it is an empirically established fact that it was too complicated for us. It is not likely that the query language we propose in this paper will serve for all time; what does matter is that it should be possible to automatically translate it into whatever richer language may be devised in the future. Simplicity now means easier conversion in the future. So one guiding rule is that nothing should be included in the query language unless it was actually used in this year’s or last year’s queries. We do not want to limit INEX participation to experimenters following an “orthodox line” in query languages. Keeping the query language simple keeps the conference open to approaches with as yet unimagined index structures and retrieval techniques. XPath and XPath-like languages penalise such approaches. 3.2 Databases and information retrieval are different. It is useful to distinguish between database query languages and information retrieval query languages. They have some similarities, but the differences are fundamental, and mean that an XML database query language is unlikely to be a good foundation for an XML information retrieval query language. The CODASYL database language, “network” databases, the relational algebra, the relational calculus, SQL, the Object Query Language (OQL) in the ODMG Object Database Standard[4], and various spatial and temporal extensions of relational databases, even the Smalltalk dialect used in Gemstone, all have these fundamental characteristics in common: • To a large extent, as [9] puts it, this “data is primarily intended for computer, not human, consumption.” • A “database” is made up of elementary values (numbers, strings, dates, and so on) aggregated using a predefined set of container types with precise data structure semantics and labelled with user defined labels (column names, relation names, and so on). • The user-defined labels have user-defined semantics which the database is aware of only to the extent that constraints are stated. • Even when there are user-defined structures (classes in ODMG, Gemstone, and SQL3, for example), these may be seen as instances of one of a fixed set of metastructures. For example, the ODMG standard provides an Object Interchange Format by means of which any object database may be dumped as a text stream; instances of classes all have a fixed format here and it is clear that “class” is a single meta-structure. • There is a structured query language with a (more-orless) formal definition which relates any legal query to a precise semantics, by appealing to the data structure semantics of the container types and meta-structures and to any stated constraints. • A query processor is expected to obey the semantics of any query it accepts precisely ; it may exploit known properties of the query language to transform a query into one with better performance, typically by using indexes. • If a query has more than one answer, all of the answers are relevant. Someone who doesn’t want all of the answers is expected to write a more specific query. Database query languages are just like programming languages. (Very bad programming languages, some of them, notably SQL.) The person formulating the query is expected to understand the relevant user-defined labels and constraints and to “program” a query which expresses his or her needs. A database query engine is required to obey the query literally, just as a C compiler is required to translate C faithfully, even rubbish. If you ask an ODMG database the OQL query select p from Persons p where p.address.city = “Dunedin” and the answer includes a p for which p.address.city = “Mosgiel”, you will be seriously unhappy, even though Mosgiel is only 10 to 15 minutes’ drive from Dunedin. Since SGML was designed, the SGML slogan has been “a document is a database”. For many years there have been SGML document database engines, notably SIM[16]. As XML is a special case of SGML, it is natural to view an XML document as a database. • The elementary values are strings. The aggregates are labelled attributed tree structures. The data structure semantics is provided by GROVEs, or the DOM. Element type names and attribute names are the user defined labels. • Constraints are stated by means of DTDs or XML Schemas. XML Schemas in particular express the notion “a database is a document”. What you get, on that view, is a database query language for tree-stru