Approximating the Schema of a Set of Documents by Means of Resemblance

The WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A semi-structured model allows to share documents that do not have exactly the same structure. However, it does not facilitate the understanding of such heterogeneous documents. In this paper, we offer a characterization and algorithm to obtain a representative (in terms of a resemblance function) of a set of heterogeneous semi-structured documents. We approximate the representative so that the resemblance function is maximized. Then, the algorithm is generalized to deal with repetitions and different classes of documents. Although an exact representative could always be found using an unlimited number of optional elements, it would cause an overfitting problem. The size of an exact representative for a set of heterogeneous documents may even make it useless. Our experiments show that, for users, it is easier and faster to deal with smaller representatives, even compensating the loss in the approximation.

[1]  Jennifer Widom Data Management for XML: Research Directions , 1999, IEEE Data Eng. Bull..

[2]  Ee-Peng Lim,et al.  DTD-Miner: a tool for mining DTD from XML documents , 2000, Proceedings Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems. WECWIS 2000.

[3]  Diego Calvanese,et al.  The Description Logic Handbook , 2007 .

[4]  Jong-Seok Jung,et al.  Extracting Information from XML Documents by Reverse Generating a DTD , 2002, EurAsia-ICT.

[5]  Felix Naumann,et al.  XStruct: Efficient Schema Extraction from Multiple and Large XML Documents , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[6]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[7]  Michel de Rougemont,et al.  Correctors for XML Data , 2004, XSym.

[8]  Richi Nayak,et al.  XML schema clustering with semantic and hierarchical similarity measures , 2007, Knowl. Based Syst..

[9]  Chen Wang,et al.  Schema Management for Document Stores , 2015, Proc. VLDB Endow..

[10]  Derick Wood,et al.  Normal form algorithms for extended context-free grammars , 2001, Theor. Comput. Sci..

[11]  A. V. Leonov,et al.  Study and Development of the DTD Generation System for XML Documents , 2005, Programming and Computer Software.

[12]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[13]  Rafael Berlanga Llavori,et al.  XML Schemata Inference and Evolution , 2003, DEXA.

[14]  Chin-Wan Chung,et al.  Efficient extraction of schemas for XML documents , 2003, Inf. Process. Lett..

[15]  Frank Neven,et al.  Learning deterministic regular expressions for the inference of schemas from XML data , 2010, ACM Trans. Web.

[16]  Jordi Cabot,et al.  Discovering Implicit Schemas in JSON Data , 2013, ICWE.

[17]  V. Batagelj,et al.  Comparing resemblance measures , 1995 .

[18]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[19]  Kyuseok Shim,et al.  XTRACT: Learning Document Type Descriptors from XML Document Collections , 2004, Data Mining and Knowledge Discovery.

[20]  Meike Klettke,et al.  Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores , 2015, BTW.

[21]  Ke Wang,et al.  Schema Discovery for Semistructured Data , 1997, KDD.

[22]  Matteo Golfarelli,et al.  Schema profiling of document-oriented databases , 2018, Inf. Syst..

[23]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[24]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[25]  Ee-Peng Lim,et al.  Re-engineering structures from Web documents , 2000, DL '00.

[26]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[27]  Giovanna Guerrini,et al.  An Overviewof Similarity Measures for Clustering XML Documents , 2007 .

[28]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[29]  Gunnar Teege Making the Difference: A Subtraction Operation for Description Logics , 1994, KR.

[30]  Vladimir Estivill-Castro,et al.  Fast and Robust General Purpose Clustering Algorithms , 2000, Data Mining and Knowledge Discovery.