Learning Schemas for Unordered XML

We consider unordered XML, where the relative order among siblings is ignored, and we investigate the problem of learning schemas from examples given by the user. We focus on the schema formalisms proposed in [10]: disjunctive multiplicity schemas (DMS) and its restriction, disjunction-free multiplicity schemas (MS). A learning algorithm takes as input a set of XML documents which must satisfy the schema (i.e., positive examples) and a set of XML documents which must not satisfy the schema (i.e., negative examples), and returns a schema consistent with the examples. We investigate a learning framework inspired by Gold [18], where a learning algorithm should be sound i.e., always return a schema consistent with the examples given by the user, and complete i.e., able to produce every schema with a sufficiently rich set of examples. Additionally, the algorithm should be efficienti.e., polynomial in the size of the input. We prove that the DMS are learnable from positive examples only, but they are not learnable when we also allow negative examples. Moreover, we show that the MS are learnable in the presence of positive examples only, and also in the presence of both positive and negative examples. Furthermore, for the learnable cases, the proposed learning algorithms return minimal schemas consistent with the examples.

[1]  Dana Angluin,et al.  Inference of Reversible Languages , 1982, JACM.

[2]  Dana Angluin,et al.  Inductive Inference of Formal Languages from Positive Data , 1980, Inf. Control..

[3]  Serge Abiteboul,et al.  Highly expressive query languages for unordered data trees , 2012, ICDT.

[4]  Pekka Kilpeläinen,et al.  One-unambiguity of regular expressions with numeric occurrence indicators , 2007, Inf. Comput..

[5]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[6]  Enrique Vidal,et al.  Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Derick Wood,et al.  One-Unambiguous Regular Languages , 1998, Inf. Comput..

[8]  W. Marsden I and J , 2012 .

[9]  Colin de la Higuera Characteristic Sets for Polynomial Grammatical Inference , 1997 .

[10]  Sebastian Maneth,et al.  XML compression via DAGs , 2013, ICDT '13.

[11]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[12]  Frank Neven,et al.  Inferring XML Schema Definitions from XML Data , 2007, VLDB.

[13]  Felix Naumann,et al.  XStruct: Efficient Schema Extraction from Multiple and Large XML Documents , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[14]  Kyuseok Shim,et al.  XTRACT: Learning Document Type Descriptors from XML Document Collections , 2004, Data Mining and Knowledge Discovery.

[15]  Iovka Boneva,et al.  Simple Schemas for Unordered XML , 2013, WebDB.

[16]  Chin-Wan Chung,et al.  Efficient extraction of schemas for XML documents , 2003, Inf. Process. Lett..

[17]  Maarten Marx,et al.  The quality of the XML Web , 2013, J. Web Semant..

[18]  Slawomir Staworko,et al.  Learning twig and path queries , 2012, ICDT '12.

[19]  Thomas Schwentick,et al.  Inference of concise regular expressions and DTDs , 2010, TODS.

[20]  Frank Neven,et al.  Learning deterministic regular expressions for the inference of schemas from XML data , 2008, WWW.

[21]  Frank Neven,et al.  DTDs versus XML schema: a practical study , 2004, WebDB '04.

[22]  Christos H. Papadimitriou,et al.  Computational complexity , 1993 .

[23]  Daniela Florescu Managing Semi-Structured Data , 2005, ACM Queue.

[24]  Laks V. S. Lakshmanan,et al.  Tree pattern query minimization , 2002, The VLDB Journal.

[25]  Thomas Schwentick,et al.  Inference of concise DTDs from XML data , 2006, VLDB.

[26]  Boris Chidlovskii Schema Extraction from XML: A Grammatical Inference Approach , 2001, KRDB.