A first approach to the automatic recognition of structural patterns in XML documents

XML is among the preferred formats for storing the structure of documents such as scientific articles, manuals, documentation, literary works, etc. Sometimes publishers adopt established and well-known vocabularies such as DocBook and TEI, other times they create partially or entirely new ones that better deal with the particular requirements of their documents. The (explicit and implicit) requirements of use in these vocabularies often follow well-established patterns, creating meta-structures (the block, the container, the inline element, etc.) that persist across vocabularies and authors and that describe a truer and more general conceptualization of the documents' building blocks. Addressing such meta-structures not only gives a better insight of what documents really are composed of, but provides abstract and more general mechanisms to work on documents regardless of the availability of specific schemas, tools and presentation stylesheets. In this paper we introduce a schemaindependent theory based on eleven structural patterns. We provide a definition of such patterns and how they synthesize characteristics emerging from real markup documents. Additionally, we propose an algorithm that allows us to identify the pattern of each element in a set of homogeneous markup documents.

[1]  Eduardo Sany Laber,et al.  An efficient language-independent method to extract content from news webpages , 2011, DocEng '11.

[2]  Angelo Di Iorio,et al.  Design patterns for descriptive document substructures , 2005, Extreme Markup Languages®.

[3]  Ian Horrocks,et al.  OWL: A Description Logic Based Ontology Language , 2005, ICLP.

[4]  Mihaela Juganaru-Mathieu,et al.  Classifying XML tags through "reading contexts" , 2005, DocEng '05.

[5]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[6]  Ricardo Gutierrez-Osuna,et al.  Elimination of junk document surrogate candidates through pattern recognition , 2007, DocEng '07.

[7]  Ian Horrocks,et al.  A Description Logic Primer , 2012, ArXiv.

[8]  Ian Horrocks,et al.  The Description Logic Handbook: OWL: a Description-Logic-Based Ontology Language for the Semantic Web , 2007 .

[9]  Diego Calvanese,et al.  The description logic handbook: theory , 2003 .

[10]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[11]  Jie Zou,et al.  Structure and content analysis for html medical articles: a hidden markov model approach , 2007, DocEng '07.

[12]  Walsh Norman,et al.  DocBook 5: The Definitive Guide , 2010 .

[13]  Ian Horrocks,et al.  14 OWL : a Description Logic Based Ontology Language for the Semantic Web , 2006 .

[14]  Angelo Di Iorio,et al.  Structural Patterns for Descriptive Documents , 2007, ICWE.

[15]  Mitsuru Ishizuka,et al.  From rhetorical structures to document structure: shallow pragmatic analysis for document engineering , 2009, DocEng '09.

[16]  Marie-Christine Jaulent,et al.  A document engineering environment for clinical guidelines , 2007, DocEng '07.

[17]  Paolo Manghi,et al.  A typed text retrieval query language for XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[18]  Diego Calvanese,et al.  The Description Logic Handbook , 2007 .

[19]  Aldo Gangemi,et al.  Content Ontology Design Patterns as Practical Building Blocks for Web Ontologies , 2008, ER.