Semantic annotation of semi-structured documents

The present paper proposes a novel method for semantic annotation of semi-structured documents using GATE (General Architecture for Text Engineering), one of the most famous and powerful annotation tools. The problem with GATE is that it is designed to annotate plain text and perform some natural language processing (NLP). Hence, when semi-structured documents are loaded, it ignores the markup or formatting information and works with text. But, depending on the document loading options (ldquomarkup awarerdquo or not) it either annotates the whole document including markup or takes just text destroying the original document structure. This behavior is unacceptable for annotating and saving annotation information into original documents which belong to popular formats (such as Microsoft Word, Excel, etc.). The proposed solution in the present paper allows saving annotations in original documents avoiding the destruction of the document contents and formatting information. The proposed method is essentially important for semantically enriching semi-structured documents (especially Microsoft Word and Excel) because it allows relating the information in these documents, without disturbing the original information, with ontological information, like ontology instances, rather than to the whole document.