Web retrieval of XML documents: practice and challenges

Web is characterized by a huge amount of very heterogeneous data sources, that differ both in media support and format representation. In this scenario, there is the need of an integrating approach for querying heterogeneous Web documents. To this purpose, XML can play an important role since it is becoming a standard for data representation and exchange over the Web. Due to its flexibility, XML is currently being used as an interface language over the Web, by which (part of) document sources are represented and exported. Under this assumption, the problem of querying heterogeneous sources can be reduced to the problem of querying XML data sources. In this chapter, we first survey the most relevant query languages for XML data proposed both by the scientific community and by standardization committees, e.g., W3C, mainly focusing on their expressive power. Then, we investigate how typical Information Retrieval concepts, such as ranking, similarity-based search, and profile-based search, can be applied to XML query languages. Commercial products based on the considered approaches are then briefly surveyed. Finally, we conclude the chapter by providing an overview of the most promising research trends in the fields.

[1]  Dan Brickley,et al.  Resource Description Framework (RDF) Model and Syntax Specification , 2002 .

[2]  David Schach,et al.  XML Query Language (XQL) , 1998, QL.

[3]  Alin Deutsch,et al.  A Query Language for XML , 1999, Comput. Networks.

[4]  Franklin Reynolds,et al.  CC/PP: A user side framework for content negotiation , 1999 .

[5]  Nicholas Kushmerick,et al.  Expressive retrieval from XML documents , 2001, SIGIR '01.

[6]  Stefano Ceri,et al.  Comparative analysis of five XML query languages , 1999, SGMD.

[7]  Mary Fernandez XML Query Languages: Experiences and Exemplars , 2001 .

[8]  Ioana Manolescu,et al.  Integrating Keyword Search into XML Query Processing , 2000, BDA.

[9]  David C. Fallside,et al.  Xml schema part 0: primer , 2000 .

[10]  Daniela Florescu,et al.  Quilt: an xml query language , 2000 .

[11]  Felix Naumann,et al.  Approximate tree embedding for querying XML data , 2000 .

[12]  Daniela Florescu,et al.  Quilt: An XML Query Language for Heterogeneous Data Sources , 2000, WebDB.

[13]  Sophie Cluet,et al.  Querying XML Documents in Xyleme , 2000, SIGIR 2000.

[14]  James A. Thom,et al.  Indexing Documents for Queries on Structure, Content and Attributes , 1997 .

[15]  Gerhard Weikum,et al.  Adding Relevance to XML , 2000, WebDB.

[16]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[17]  William W. Cohen WHIRL: A word-based information representation language , 2000, Artif. Intell..

[18]  Masatoshi Yoshikawa,et al.  An XML indexing structure with relative region coordinate , 2001, Proceedings 17th International Conference on Data Engineering.

[19]  Torsten Schlieder,et al.  Result Ranking for Structured Queries against XML Documents , 2000, DELOS.

[20]  Sophie Cluet,et al.  Your mediators need data conversion! , 1998, SIGMOD '98.

[21]  Dongwook Shin,et al.  BUS: an effective indexing and retrieval scheme in structured documents , 1998, DL '98.

[22]  Roy Goldman,et al.  From Semistructured Data to XML: Migrating the Lore Data Model and Query Language , 1999, Markup Lang..

[23]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[24]  S. Boag,et al.  XQuery 1.0 : An XML query language, W3C Working Draft 12 November 2003 , 2003 .

[25]  David Maier Database Desiderata for an XML Query Language , 1998, QL.

[26]  David J. DeWitt,et al.  The Niagara Internet Query System , 2001, IEEE Data Eng. Bull..

[27]  N. Fuhr An Extension of XQL for Information Retrieval , 2000 .

[28]  Masatoshi Yoshikawa,et al.  An efficiently updatable index scheme for structured documents , 1998, Proceedings Ninth International Workshop on Database and Expert Systems Applications (Cat. No.98EX130).