Schema Discovery of the Semi-structured and Hierarchical Data

Web data are typically Semi-structured data and lack explicit external schema information, which makes querying and browsing the web data inefficient. In this paper, we present an approach to discover the inherent schema(s) in semi-structured, hierarchical data sources fast and efficiently, based on OEM model and efficient pruning strategy. The schema discovered by our algorithm is a kind of data path expressions and can be transformed into schema tree easily.

[1]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[2]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[3]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[4]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[5]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[6]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[7]  Jeffrey D. Ullman,et al.  Representative objects: concise representations of semistructured, hierarchical data , 1997, Proceedings 13th International Conference on Data Engineering.

[8]  Roy Goldman,et al.  LORE: a Lightweight Object REpository for semistructured data , 1996, SIGMOD '96.

[9]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[10]  Ying Chen,et al.  Versatile: a scalable CORBA-based system for integrating distributed data , 1997, 1997 IEEE International Conference on Intelligent Processing Systems (Cat. No.97TH8335).