Guideline Design of an Active Gene Annotation Corpus for the Purpose of Drug Repurposing

In order to develop a gold corpus for Biomedical Natural Language Processing community for the sake of knowledge discovery in drug repurposing, an active gene annotation corpus (AGAC) was developed in this research. Five semantic trigger labels and three root regulatory trigger labels were designed as molecular- and cell- level biological entity annotations, which focused on the information of function change in biological processes resulted from mutated genes. In addition, predicates ‘ThemeOf’ and ‘CauseOf’ were as well annotated manually for the semantic knowledge extraction. Eventually, roles of gene mutation including gain of function (GOF) and loss of function (LOF) were curated through the AGAC annotation guideline. The information from AGAC annotation effectively bridge the association between mutation, gene, drug and disease, and make it possible to predict new drug direction in a large scale. AGAC corpus availability: The corpus is available in PubAn-notation platform11http://pubannotation.org/projects/HZAU_Active_Gene_Corpus.