The Document Components Ontology (DoCO)

The availability in machine-readable form of descriptions of the structure of documents, as well as of the document discourse (e.g. the scientific discourse within scholarly articles), is crucial for facilitating semantic publishing and the overall comprehension of documents by both users and machines. In this paper we introduce DoCO, the Document Components Ontology, an OWL 2 DL ontology that provides a general-purpose structured vocabulary of document elements to describe both structural and rhetorical document components in RDF. In addition to describing the formal description of the ontology, this paper showcases its utility in practice in a variety of our own applications and other activities of the Semantic Publishing community that rely on DoCO to annotate and retrieve document components of scholarly articles.

[1]  Andrei Voronkov,et al.  PDFX: fully-automated PDF-to-XML conversion of scientific literature , 2013, ACM Symposium on Document Engineering.

[2]  Sergey Parinov Open Repository of Semantic Linkages , 2012, CRIS.

[3]  David M. Shotton,et al.  Semantic publishing: the coming revolution in scientific journal publishing , 2009, Learn. Publ..

[4]  Fabio Vitali,et al.  Scholarly publishing and linked data: describing roles, statuses, temporal and contextual extents , 2012, I-SEMANTICS '12.

[5]  Maarten Marx,et al.  ParlBench: A SPARQL Benchmark for Electronic Publishing Applications , 2013, ESWC.

[6]  David M. Shotton,et al.  Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article , 2009, PLoS Comput. Biol..

[7]  Jeff Beck Report from the Field: PubMed Central, an XML-based Archive of Life Sciences Journal Articles , 2010 .

[8]  Siegfried Handschuh,et al.  SALT: Weaving the Claim Web , 2007, ISWC/ASWC.

[9]  Mirina Grosz,et al.  World Wide Web Consortium , 2010 .

[10]  Alan Ruttenberg,et al.  The SWAN biomedical discourse ontology , 2008, J. Biomed. Informatics.

[11]  Anita de Waard,et al.  From Proteins to Fairytales: Directions in Semantic Publishing , 2010, IEEE Intell. Syst..

[12]  Silvio Peroni,et al.  CiTO + SWAN: The web semantics of bibliographic records, citations, evidence and discourse relationships , 2014, Semantic Web.

[13]  Silvio Peroni,et al.  The Collections Ontology: Creating and handling collections in OWL 2 DL frameworks , 2014, Semantic Web.

[14]  Leyla Jael García Castro,et al.  Biotea: RDFizing PubMed Central in support for the paper as an interface to the Web of Data , 2013, Journal of Biomedical Semantics.

[15]  Angelo Di Iorio,et al.  A first approach to the automatic recognition of structural patterns in XML documents , 2012, DocEng '12.

[16]  Christoph Lange,et al.  Ontologies and languages for representing mathematical knowledge on the Semantic Web , 2013, Semantic Web.

[17]  Enrico Motta,et al.  Clustering Citation Distributions for Semantic Categorization and Citation Prediction , 2014, LISC@ISWC.

[18]  Angelo Di Iorio,et al.  Dealing with structural patterns of XML documents , 2014, J. Assoc. Inf. Sci. Technol..

[19]  Fabio Vitali,et al.  Modelling OWL Ontologies with Graffoo , 2014, ESWC.

[20]  Peroni Silvio Partial example of use of DoCO , 2015 .

[21]  Tudor Groza,et al.  A review of argumentation for the Social Semantic Web , 2013, Semantic Web.

[22]  Angelo Di Iorio,et al.  Recognising document components in XML-based academic articles , 2013, ACM Symposium on Document Engineering.

[23]  Siegfried Handschuh,et al.  SALT - Semantically Annotated LaTeX for scientific publications , 2007 .

[24]  Andrea Giovanni Nuzzolese,et al.  Describing bibliographic references in RDF , 2014, SePublica.

[25]  Silvio Peroni,et al.  FaBiO and CiTO: Ontologies for describing bibliographic resources and citations , 2012, J. Web Semant..

[26]  Steve Pettifer,et al.  Utopia documents: linking scholarly literature with research data , 2010, Bioinform..

[27]  Fabio Vitali,et al.  A Pattern-Based Ontology for Describing Publishing Workflows , 2014, WOP.

[28]  Mihaela Juganaru-Mathieu,et al.  Classifying XML tags through "reading contexts" , 2005, DocEng '05.

[29]  Jie Zou,et al.  Structure and content analysis for html medical articles: a hidden markov model approach , 2007, DocEng '07.

[30]  Carlo Meghini,et al.  A preliminary study on the semantic representation of the notes to Dante Alighieri's Convivio , 2013, DH-CASE '13.

[31]  Óscar Corcho,et al.  A review of ontologies for describing scholarly and scientific documents , 2014, SePublica.

[32]  Steve Pettifer,et al.  Ceci n'est pas un hamburger: modelling and representing the scholarly article , 2011, Learn. Publ..

[33]  Matteo Romanello,et al.  Citations and annotations in classics: old problems and new perspectives , 2013, DH-CASE '13.

[34]  Mikhail R. Kogalovsky,et al.  Semantic linkages in research information systems as a new data source for scientometric studies , 2014, Scientometrics.

[35]  Angelo Di Iorio,et al.  A Semantic Web approach to everyday overlapping markup , 2011, J. Assoc. Inf. Sci. Technol..

[36]  Alexandru Constantin,et al.  Automatic structure and keyphrase analysis of scientific publications , 2014 .

[37]  Walsh Norman,et al.  DocBook 5: The Definitive Guide , 2010 .