Discovering Restricted Regular Expressions with Interleaving

Discovering a concise schema from given XML documents is an important problem in XML applications. In this paper, we focus on the problem of learning an unordered schema from a given set of XML examples, which is actually a problem of learning a restricted regular expression with interleaving using positive example strings. Schemas with interleaving could present meaningful knowledge that cannot be disclosed by previous inference techniques. Moreover, inference of the minimal schema with interleaving is challenging. The problem of finding a minimal schema with interleaving is shown to be NP-hard. Therefore, we develop an approximation algorithm and a heuristic solution to tackle the problem using techniques different from known inference algorithms. We do experiments on real-world data sets to demonstrate the effectiveness of our approaches. Our heuristic algorithm is shown to produce results that are very close to optimal.

[1]  Philip S. Yu,et al.  Discovering Frequent Closed Partial Orders from Strings , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2]  Shuji Tsukiyama,et al.  A New Algorithm for Generating All the Maximal Independent Sets , 1977, SIAM J. Comput..

[3]  R. Bailey,et al.  The number of weak orderings of a finite set , 1998 .

[4]  Frank Neven,et al.  Inferring XML Schema Definitions from XML Data , 2007, VLDB.

[5]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[6]  Ravi B. Boppana,et al.  Approximating maximum independent sets by excluding subgraphs , 1990, BIT.

[7]  Jean-Marie De Koninck,et al.  Those Fascinating Numbers , 2009 .

[8]  Iovka Boneva,et al.  Schemas for Unordered XML on a DIME , 2014, Theory of Computing Systems.

[9]  Frank Neven,et al.  Learning deterministic regular expressions for the inference of schemas from XML data , 2008, WWW.

[10]  Iovka Boneva,et al.  Simple Schemas for Unordered XML , 2013, WebDB.

[11]  Frank Neven,et al.  Simplifying XML schema: effortless handling of nondeterministic regular expressions , 2009, SIGMOD Conference.

[12]  Serge Abiteboul,et al.  Highly Expressive Query Languages for Unordered Data Trees , 2012, ICDT '12.

[13]  Aristides Gionis,et al.  Fragments of order , 2003, KDD '03.

[14]  Heikki Mannila,et al.  Global partial orders from sequential data , 2000, KDD '00.

[15]  Alexey Ignatiev,et al.  On Reducing Maximum Independent Set to Minimum Satisfiability , 2014, SAT.

[16]  Thomas Schwentick,et al.  Inference of concise DTDs from XML data , 2006, VLDB.

[17]  Slawomir Staworko,et al.  Learning Schemas for Unordered XML , 2013, DBPL.

[18]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.