Filling the Gaps Between Tools and Users: A Tool Comparator, Using Protein-Protein Interactions as an Example

Recently, several text mining programs have reached a near-practical level of performance. Some systems are already being used by biologists and database curators. However, it has also been recognized that current Natural Language Processing (NLP) and Text Mining (TM) technology is not easy to deploy, since research groups tend to develop systems that cater specifically to their own requirements. One of the major reasons for the difficulty of deployment of NLP/TM technology is that re-usability and interoperability of software tools are typically not considered during development. While some effort has been invested in making interoperable NLP/TM toolkits, the developers of end-to-end systems still often struggle to reuse NLP/TM tools, and often opt to develop similar programs from scratch instead. This is particularly the case in BioNLP, since the requirements of biologists are so diverse that NLP tools have to be adapted and re-organized in a much more extensive manner than was originally expected. Although generic frameworks like UIMA (Unstructured Information Management Architecture) provide promising ways to solve this problem, the solution that they provide is only partial. In order for truly interoperable toolkits to become a reality, we also need sharable type systems and a developer-friendly environment for software integration that includes functionality for systematic comparisons of available tools, a simple I/O interface, and visualization tools. In this paper, we describe such an environment that was developed based on UIMA, and we show its feasibility through our experience in developing a protein-protein interaction (PPI) extraction system.

[1]  Alessandro Moschitti,et al.  Making Tree Kernels Practical for Natural Language Learning , 2006, EACL.

[2]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[3]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[4]  Serguei V. S. Pakhomov,et al.  High Throughput Modularized NLP System for Clinical Text , 2005, ACL.

[5]  Kalina Bontcheva,et al.  GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[6]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[7]  あかね 藥師寺,et al.  Relation information extraction using deep syntactic analysis , 2006 .

[8]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[9]  Naoaki Okazaki,et al.  Data and text mining Building an abbreviation dictionary using a term recognition approach , 2006 .

[10]  Junichi Tsujii,et al.  RELATION INFORMATION EXTRACTION USING DEEP SYNTACTIC ANALYSIS , 2006 .

[11]  Jun'ichi Tsujii,et al.  Syntactic Features for Protein-Protein Interaction Extraction , 2007, LBM.

[12]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[13]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[14]  Razvan C. Bunescu,et al.  Subsequence Kernels for Relation Extraction , 2005, NIPS.

[15]  David A. Ferrucci,et al.  Building an example application with the Unstructured Information Management Architecture , 2004, IBM Syst. J..

[16]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[17]  Tapio Salakoski,et al.  Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches , 2006, BMC Bioinformatics.

[18]  A. Dunker The pacific symposium on biocomputing , 1998 .

[19]  Helen L. Johnson,et al.  Corpus Refactoring: a Feasibility Study , 2007, Journal of biomedical discovery and collaboration.

[20]  Jun'ichi Tsujii,et al.  Evaluating Impact of Re-training a Lexical Disambiguation Model on Domain Adaptation of an HPSG Parser , 2007, Trends in Parsing Technology.