Taxonomic Corpus-Based Concept Summary Generation for Document Annotation

Semantic annotation is an enabling technology which links documents to concepts that unambiguously describe their content. Annotation improves access to document contents for both humans and software agents. However, the annotation process is a challenging task as annotators often have to select from thousands of potentially relevant concepts from controlled vocabularies. The best approaches to assist in this task rely on reusing the annotations of an annotated corpus. In the absence of a pre-annotated corpus, alternative approaches suffer due to insufficient descriptive texts for concepts in most vocabularies. In this paper, we propose an unsupervised method for recommending document annotations based on generating node descriptors from an external corpus. We exploit knowledge of the taxonomic structure of a thesaurus to ensure that effective descriptors (concept summaries) are generated for concepts. Our evaluation on recommending annotations show that the content that we generate effectively represents the concepts. Also, our approach outperforms those which rely on information from a thesaurus alone and is comparable with supervised approaches.

[1]  Rafael Berlanga Llavori,et al.  Tailored semantic annotation for semantic search , 2015, J. Web Semant..

[2]  Zhiyong Lu,et al.  Recommending MeSH terms for annotating biomedical articles , 2011, J. Am. Medical Informatics Assoc..

[3]  Heiko Paulheim,et al.  WikiMatch - using Wikipedia for ontology matching , 2012, OM.

[4]  Fleur Mougin,et al.  Large scale biomedical texts classification: a kNN and an ESA-based approaches , 2016, J. Biomed. Semant..

[5]  Troels Andreasen,et al.  Perspectives on ontology‐based querying , 2007, Int. J. Intell. Syst..

[6]  Euripides G. M. Petrakis,et al.  Information Retrieval by Semantic Similarity , 2006, Int. J. Semantic Web Inf. Syst..

[7]  Susanne M. Humphrey,et al.  The NLM Indexing Initiative's Medical Text Indexer , 2004, MedInfo.

[8]  Derek Greene,et al.  Unsupervised graph-based topic labelling using dbpedia , 2013, WSDM.

[9]  Escuela Politécnica Superior,et al.  Semantically enhanced Information Retrieval: an ontology-based approach , 2009 .

[10]  Falk Scholer,et al.  Query‐biased summary generation assisted by query expansion , 2015, J. Assoc. Inf. Sci. Technol..

[11]  Ansgar Scherp,et al.  A Comparison of Different Strategies for Automated Semantic Document Annotation , 2015, K-CAP.

[12]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[13]  Wessel Kraaij,et al.  MeSH Up: effective MeSH text classification for improved document retrieval , 2009, Bioinform..

[14]  Timos K. Sellis,et al.  GoNTogle: A Tool for Semantic Annotation and Search , 2010, ESWC.

[15]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[16]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.