An overview of semistructured data

Research on semistructured data started from the observation that much of today's electronic dat a does not conform to traditional relational, or object oriented data models . Several applications store their data in non-standard data formats ; legacy systems . structured documents . like HTML or SGML etc . Another instance is the integration of heterogeneous data sources : often these sources belong to external organizations . or partners . not under the application's control, and thei r structure is only partially known. and may change without notice . Data in these applications could be modeled as an object-oriented data . but its structure is irregular : some objects may have missing attributes . others may have multiple occurrences of the same attribute . the same attribut e may have different types in different objects, semantically related information may be represented differently in various objects . Data with these characteristics has been called semistructured data . Recent research has aimed at extending database management techniques to semistructure d data. The result of this work is a new paradigm in databases . complementing the relational an d object-oriented one. The new model has been applied to a. few research prototypes, like dat a integration [PGMW95 . PAGM96] . Web site management [FFK +98] . general-purpose management of semistructured data [QRS + 95 . AQUIt 97, MAG + 97], and data conversion [SCJK98] . Other research work has addressed query language design [AQM + 97. BDHS96a. PAGM96 . FFLS97a . FFLS97b] schema specification [BDFS97. MS99 . DGM98. BM99] . schema extraction [NUWC97 . BDFS97 . GW97 . NAM97], optimizations [AV97b, Suc96 . Suc97 . FS98. MW97] . indexing [MWA+98 . NIS99] . as well as formal aspects [AV97a, AV97b . MM97 . FLS98] . The interested reader may wan t to consult the tutorials by Abiteboul [Abi97] and Bunernan [Bun97] . In the traditional relational paradigm a database is modeled as a finite . first order structure . The first order vocabulary, i .e . the names and arities of the relations of that structure models th e database schema (the list of all table names and their attributes), and first order logic, or highe r logic formulae model queries . Database theory is related to both finite model theory and descriptiv e complexity. It makes no sense however to talk about finite structures or formulae without fixing the vocabulary first . In semistructured databases we do not have an a priory schema, hence no vocabulary . One way to model semistructured databases is to represent them as graphs . Thus. the vocabulary is fixed once and forever to be that of graphs . The real schema of a given database instance is now

[1]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[2]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[3]  Jennifer Widom,et al.  Query Optimization for Semistructured Data , 1997 .

[4]  Dan Suciu,et al.  Optimizing regular path expressions using graph schemas , 1998, Proceedings 14th International Conference on Data Engineering.

[5]  Dan Suciu,et al.  Comprehension syntax , 1994, SGMD.

[6]  Dan Suciu,et al.  A Query Language and Processor for a Web-Site Management System , 1997 .

[7]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[8]  Dan Suciu,et al.  Query containment for conjunctive queries with regular expressions , 1998, PODS.

[9]  Limsoon Wong,et al.  Naturally Embedded Query Languages , 1992, ICDT.

[10]  Jeffrey D. Ullman,et al.  Representative objects: concise representations of semistructured, hierarchical data , 1997, Proceedings 13th International Conference on Data Engineering.

[11]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[12]  Dan Suciu,et al.  A query language for a Web-site management system , 1997, SGMD.

[13]  Robert E. Tarjan,et al.  Three Partition Refinement Algorithms , 1987, SIAM J. Comput..

[14]  Catriel Beeri,et al.  Schemas for Integration and Translation of Structured and Semi-structured Data , 1999, ICDT.

[15]  Serge Abiteboul,et al.  Inferring structure in semistructured data , 1997, SGMD.

[16]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[17]  Serge Abiteboul,et al.  Regular path queries with constraints , 1997, PODS '97.

[18]  Neil Immerman,et al.  Languages that Capture Complexity Classes , 1987, SIAM J. Comput..

[19]  Alberto O. Mendelzon,et al.  Formal models of Web queries , 1997, Inf. Syst..

[20]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[21]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[22]  Jennifer Widom,et al.  Indexing Semistructured Data , 1998 .

[23]  Diego Calvanese,et al.  What can Knowledge Representation do for Semi-Structured Data? , 1998, AAAI/IAAI.

[24]  Dan Suciu,et al.  Catching the boat with Strudel: experiences with a Web-site management system , 1998, SIGMOD '98.

[25]  Robin Milner,et al.  Communication and concurrency , 1989, PHI Series in computer science.

[26]  Jeffrey D. Uuman Principles of database and knowledge- base systems , 1989 .

[27]  Masatoshi Yoshikawa,et al.  ILOG: Declarative Creation and Manipulation of Object Identifiers , 1990, VLDB.

[28]  Yannis Papakonstantinou,et al.  Object Fusion in Mediator Systems , 1996, VLDB.

[29]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[30]  Serge Abiteboul,et al.  Queries and computation on the web , 1997, Theor. Comput. Sci..

[31]  Sophie Cluet,et al.  Your mediators need data conversion! , 1998, SIGMOD '98.

[32]  Dan Suciu,et al.  Query Decomposition and View Maintenance for Query Languages for Unstructured Data , 1996, VLDB.

[33]  Dan Suciu,et al.  Programming Constructs for Unstructured Data , 1995, DBPL.