BioLexicon: A Lexical Resource for the Biology Domain

Natural language processing technologies have advanced remarkably in the past two decades. However, biological terminology is a frequent cause of analysis errors when processing literature written in the biology domain. The BOOTStrep BioLexicon is a linguistic resource tailored for the domain to cope with these problems. It contains the following types of entries: (1) a set of terminological verbs; (2) a set of derived forms of the terminological verbs; (3) general English words frequently used in the biology domain; (4) domain terms. This comprehensive coverage of biological terms makes the lexicon a unique linguistic resource within the domain. This paper focuses on the linguistic aspects of the lexicon.

[1]  Constantin F. Aliferis,et al.  Studies in Health Technology and Informatics , 2007 .

[2]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[3]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[4]  Claudia Soria,et al.  Lexical Markup Framework (LMF) , 2006, LREC.

[5]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[6]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[7]  Sophia Ananiadou,et al.  How to make the most of NE dictionaries in statistical NER , 2008, BMC Bioinformatics.

[8]  Yusuke Miyao,et al.  Probabilistic modeling of argument structures including non-local dependencies , 2003 .

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Jun'ichi Tsujii,et al.  Adapting a Probabilistic Disambiguation Model of an HPSG Parser to a New Domain , 2005, IJCNLP.

[11]  Sophia Ananiadou,et al.  Learning string similarity measures for gene/protein name dictionary look-up using logistic regression , 2007, Bioinform..

[12]  Allen C. Browne,et al.  UMLS language and vocabulary tools. , 2003, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[13]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[14]  Nicoletta Calzolari,et al.  A lexicon for biology and bioinformatics: the BOOTStrep experience , 2008, LREC.

[15]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[16]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[17]  Dietrich Rebholz-Schuhmann,et al.  Gene Regulation Ontology (GRO): Design Principles and Use Cases , 2008, MIE.

[18]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.