Infrastructure for Semantic Annotation in the Genomics Domain

We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST is also connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words.

[1]  Peng Bi,et al.  Handbook of Linguistic Annotation , 2018, J. Quant. Linguistics.

[2]  Padmini Srinivasan,et al.  MeSHmap: a text mining tool for MEDLINE , 2001, AMIA.

[3]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[4]  K. Bretonnel Cohen,et al.  Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles , 2017, BMC Bioinformatics.

[5]  Yue Wang,et al.  PubAnnotation - a persistent and sharable corpus and annotation repository , 2012, BioNLP@HLT-NAACL.

[6]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[7]  Ide Nancy,et al.  Language Applications Grid , 2017 .

[8]  Barry Smith,et al.  Dependence Relationships between Gene Ontology Terms based on TIGR Gene Product Annotations , 2004 .

[9]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[10]  Nancy Ide,et al.  Mining Biomedical Publications With The LAPPS Grid , 2018, LREC.

[11]  K. Bretonnel Cohen,et al.  The Colorado Richly Annotated Full Text (CRAFT) Corpus: Multi-Model Annotation in the Biomedical Domain , 2017 .

[12]  Yang Jin,et al.  Automated recognition of malignancy mentions in biomedical literature , 2006, BMC Bioinformatics.

[13]  Patrick Lambrix,et al.  Selecting an Ontology for Biomedical Text Mining , 2009, BioNLP@HLT-NAACL.

[14]  Paul Rayson,et al.  Scaling out for extreme scale corpus data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[15]  James Pustejovsky,et al.  The Language Application Grid Web Service Exchange Vocabulary , 2014, WLSI.

[16]  Sophia Ananiadou,et al.  Towards Data and Goal Oriented Analysis: Tool Inter-operability and Combinatorial Comparison , 2008, IJCNLP.

[17]  Maricel G. Kann,et al.  Protein interactions and disease: computational approaches to uncover the etiology of diseases , 2007, Briefings Bioinform..

[18]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[19]  Sophia Ananiadou,et al.  Argo: an integrative, interactive, text mining-based workbench supporting curation , 2012, Database J. Biol. Databases Curation.

[20]  Paul Rayson,et al.  lexiDB: A scalable corpus database management system , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[21]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[22]  Sophia Ananiadou,et al.  The GENIA Corpus: Annotation Levels and Applications , 2017 .

[23]  Mahmoud El-Haj,et al.  Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger , 2018, LREC.