XML is the de facto standard format for data exchange on the Web. While it is fairly simple to generate XML data, it is a complex task to design a schema and then guarantee that the generated data is valid according to that schema. As a consequence much XML data does not have a schema or is not accompanied by its schema. In order to gain the benefits of having a schema - efficient querying and storage of XML data, semantic verification, data integration, etc.- this schema must be extracted. In this paper we present an automatic technique, XStruct, for XML Schema extraction. Based on ideas of [5], XStruct extracts a schema for XML data by applying several heuristics to deduce regular expressions that are 1-unambiguous and describe each element’s contents correctly but generalized to a reasonable degree. Our approach features several advantages over known techniques: XStruct scales to very large documents (beyond 1GB) both in time and memory consumption; it is able to extract a general, complete, correct, minimal, and understandable schema for multiple documents; it detects datatypes and attributes. Experiments confirm these features and properties.
[1]
Kyuseok Shim,et al.
XTRACT: Learning Document Type Descriptors from XML Document Collections
,
2004,
Data Mining and Knowledge Discovery.
[2]
Serge Abiteboul,et al.
Extracting schema from semistructured data
,
1998,
SIGMOD '98.
[3]
Kam-Fai Wong,et al.
Approximate Graph Schema Extraction for Semi-Structured Data
,
2000,
EDBT.
[4]
Boris Chidlovskii.
Schema Extraction from XML: A Grammatical Inference Approach
,
2001,
KRDB.
[5]
Derick Wood,et al.
One-Unambiguous Regular Languages
,
1998,
Inf. Comput..
[6]
Chin-Wan Chung,et al.
Efficient extraction of schemas for XML documents
,
2003,
Inf. Process. Lett..