Adding Structure to Unstructured Data

We develop a new schema for unstructured data. Traditional schemas resemble the type systems of programming languages. For unstructured data, however, the underlying type may be much less constrained and hence an alternative way of expressing constraints on the data is needed. Here, we propose that both data and schema be represented as edge-labeled graphs. We develop notions of conformance between a graph database and a graph schema and show that there is a natural and efficiently computable ordering on graph schemas. We then examine certain subclasses of schemas and show that schemas are closed under query applications. Finally, we discuss how they may be used in query decomposition and optimization.

[1]  Robert E. Tarjan,et al.  Three Partition Refinement Algorithms , 1987, SIAM J. Comput..

[2]  B. Dreben,et al.  The decision problem: Solvable classes of quantificational formulas , 1979 .

[3]  Serge Abiteboul,et al.  From structured documents to novel query facilities , 1994, SIGMOD '94.

[4]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[5]  Jozef Gruska Foundations of Computing , 1997 .

[6]  Carl A. Gunter Semantics of programming languages: structures and techniques , 1993, Choice Reviews Online.

[7]  Jonathan I. Maletic,et al.  Automatic software clustering via Latent Semantic Analysis , 1999, 14th IEEE International Conference on Automated Software Engineering.

[8]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[9]  Alberto O. Mendelzon,et al.  GraphLog: a visual formalism for real life recursion , 1990, PODS '90.

[10]  Dominique Perrin,et al.  Finite Automata , 1958, Philosophy.

[11]  SuciuDan,et al.  A query language and optimization techniques for unstructured data , 1996 .

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  Jan van Leeuwen,et al.  Formal models and semantics , 1990 .

[14]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[15]  Richard Durbin,et al.  Acedb --- a c. elegans database: syntactic definitions for the acedb data base manager , 1992 .

[16]  Dan Suciu,et al.  Programming Constructs for Unstructured Data , 1995, DBPL.

[17]  Dan Suciu,et al.  Query Decomposition and View Maintenance for Query Languages for Unstructured Data , 1996, VLDB.

[18]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1997, International Journal on Digital Libraries.

[19]  Jennifer Widom,et al.  Querying Semistructured Heterogeneous Information , 1995, J. Syst. Integr..

[20]  J. Cherniavsky Review of "Unsolvable classes of quantificational formulas" by Harry R. Lewis. Addison-Wesley 1979. and "The decision problem: solvable classes of quantificational formulas" by Burton Dreben and Warren D. Goldfarb. Addison-Wesley 1979. , 1982, SIGA.

[21]  Guido Moerkotte,et al.  Evaluating queries with generalized path expressions , 1996, SIGMOD '96.

[22]  Thomas A. Henzinger,et al.  Computing simulations on finite and infinite graphs , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[23]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..