SimSem: Fast Approximate String Matching in Relation to Semantic Category Disambiguation

In this study we investigate the merits of fast approximate string matching to address challenges relating to spelling variants and to utilise large-scale lexical resources for semantic class disambiguation. We integrate string matching results into machine learning-based disambiguation through the use of a novel set of features that represent the distance of a given textual span to the closest match in each of a collection of lexical resources. We collect lexical resources for a multitude of semantic categories from a variety of biomedical domain sources. The combined resources, containing more than twenty million lexical items, are queried using a recently proposed fast and efficient approximate string matching algorithm that allows us to query large resources without severely impacting system performance. We evaluate our results on six corpora representing a variety of disambiguation tasks. While the integration of approximate string matching features is shown to substantially improve performance on one corpus, results are modest or negative for others. We suggest possible explanations and future research directions. Our lexical resources and implementation are made freely available for research purposes at: http://github.com/ninjin/simsem

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[3]  Jun'ichi Tsujii,et al.  Boosting Precision and Recall of Dictionary-Based Protein Name Recognition , 2003, BioNLP@ACL.

[4]  Robert S. Ledley,et al.  The Protein Information Resource , 2003, Nucleic Acids Res..

[5]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[6]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[7]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[8]  Fabien Campagne,et al.  Building a protein name dictionary from full text: a machine learning term extraction approach , 2005, BMC Bioinformatics.

[9]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[10]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[11]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[12]  Sophia Ananiadou,et al.  How to make the most of NE dictionaries in statistical NER , 2008, BMC Bioinformatics.

[13]  Yue Wang,et al.  Investigating heterogeneous protein annotations toward cross-corpora utilization , 2009, BMC Bioinformatics.

[14]  Sophia Ananiadou,et al.  Construction of an annotated corpus to support biomedical information extraction , 2009, BMC Bioinformatics.

[15]  Hongfang Liu,et al.  BioTagger-GM: a gene/protein name recognition system. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[16]  Martijn J. Schuemie,et al.  A dictionary to identify small molecules and drugs in free text , 2009, Bioinform..

[17]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[18]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[19]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[20]  Naoaki Okazaki,et al.  Simple and Efficient Algorithm for Approximate Dictionary Matching , 2010, COLING.

[21]  Dietrich Rebholz-Schuhmann,et al.  Calbc Silver Standard Corpus , 2010, J. Bioinform. Comput. Biol..

[22]  Jari Björne,et al.  Scaling up Biomedical Event Extraction to the Entire PubMed , 2010, BioNLP@ACL.

[23]  Alberto Lavelli,et al.  Disease Mention Recognition with Specific Features , 2010, BioNLP@ACL.

[24]  Akinori Yonezawa,et al.  Overview of Genia Event Task in BioNLP Shared Task 2011 , 2011, BioNLP@ACL.

[25]  Sampo Pyysalo,et al.  Overview of the Epigenetics and Post-translational Modifications (EPI) task of BioNLP Shared Task 2011 , 2011, BioNLP@ACL.

[26]  Sampo Pyysalo,et al.  Overview of the Infectious Diseases (ID) task of BioNLP Shared Task 2011 , 2011, BioNLP@ACL.