Towards Enhanced Interoperability for Large HLT Systems : UIMA for NLP

We introduce JCORE, a full-fledged UIMA -compliant component repository for complex text analytics developed at the Jena University Language & Information Engineering (J ULIE) Lab. JCORE is based on a comprehensive type system and a variety of document readers, analysis engines, and CAS consumers. We survey these components and then turn to a discussion of lessons we learnt, with particular emphasis on managing the underlying type system. We briefly sketch two complex NLP applications which can easily be built from the components contained in JC ORE.

[1]  Carol Friedman,et al.  Towards a comprehensive medical language processing system: methods and issues , 1997, AMIA.

[2]  Dan Flickinger,et al.  Minimal Recursion Semantics: An Introduction , 2005 .

[3]  Yorick Wilks,et al.  Software Infrastructure for Natural Language Processing , 1997, ANLP.

[4]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[5]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[6]  Ulrich Callmeier,et al.  PET – a platform for experimentation with efficient HPSG processing techniques , 2000, Natural Language Engineering.

[7]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[8]  Markus L. Noga,et al.  A Lightweight XML-based Middleware Architecture , 2001 .

[9]  Wendy W. Chapman,et al.  Evaluation of negation phrases in narrative clinical reports , 2001, AMIA.

[10]  Serguei V. S. Pakhomov Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.

[11]  Hans Uszkoreit,et al.  New Chances for Deep Linguistic Processing , 2002 .

[12]  Teruko Mitamura,et al.  Deriving Semantic Knowledge from Descriptive Texts Using an MT System , 2002, AMTA.

[13]  Hans-Ulrich Krieger SDL—A Description Language for Building NLP Systems , 2003, HLT-NAACL 2003.

[14]  Ilya M. Goldin,et al.  Learning to Detect Negation with ‘Not’ in Medical Texts , 2003 .

[15]  Stefan Evert,et al.  The NITE XML Toolkit: Flexible annotation for multimodal language data , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[16]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[17]  Jun'ichi Tsujii,et al.  Part-of-Speech Annotation of Biology Research Abstracts , 2004, LREC.

[18]  Ulrich Schäfer,et al.  Shallow Processing with Unification and Typed Feature Structures - Foundations and Applications , 2004, Künstliche Intell..

[19]  Anette Frank Constraint-based RMRS Construction from Shallow Grammars , 2004, COLING.

[20]  Kalina Bontcheva,et al.  Evolving GATE to meet new challenges in language engineering , 2004, Natural Language Engineering.

[21]  James W. Cooper,et al.  Text analytics for life science using the Unstructured Information Management Architecture , 2004, IBM Syst. J..

[22]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[23]  Andreas Eisele,et al.  The DeepThought Core Architecture Framework , 2004, LREC.

[24]  Thilo Götz,et al.  Design and implementation of the UIMA Common Analysis System , 2004, IBM Syst. J..

[25]  Yiming Yang,et al.  Robustness of adaptive filtering methods in a cross-benchmark evaluation , 2005, SIGIR '05.

[26]  Christopher G. Chute,et al.  Domain-specific language models and lexicons for tagging , 2005, J. Biomed. Informatics.

[27]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[28]  Christopher G. Chute,et al.  Developing a corpus of clinical notes manually annotated for part-of-speech , 2006, Int. J. Medical Informatics.

[29]  Paul Buitelaar,et al.  Generating and Visualizing a Soccer Knowledge Base , 2006, EACL.

[30]  Ulrich Schäfer,et al.  Integrating deep and shallow natural language processing components: representations and hybrid architectures , 2006 .

[31]  Philip V. Ogren,et al.  Knowtator: A Protégé plug-in for annotated corpus construction , 2006, NAACL.

[32]  Donna Gates,et al.  Understanding Temporal Expressions in Emails , 2006, NAACL.

[33]  Christopher G. Chute,et al.  Text Analysis Integration into a Medical Information Retrieval System: Challenges Related to Word Sense Disambiguation , 2007 .

[34]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[35]  Alexander I. Rudnicky,et al.  Summarizing non-textual events with 'Briefing' focus , 2007, RIAO.

[36]  Berthold Crysmann,et al.  Question answering from structured knowledge sources , 2007, J. Appl. Log..

[37]  Guergana K. Savova,et al.  System Evaluation on a Named Entity Corpus from Clinical Notes , 2008, LREC.

[38]  Christopher G. Chute,et al.  Constructing Evaluation Corpora for Automated Clinical Named Entity Recognition , 2008, LREC.

[39]  Christopher G. Chute,et al.  Word sense disambiguation across two domains: Biomedical literature and clinical notes , 2008, J. Biomed. Informatics.

[40]  Hans Uszkoreit,et al.  Extracting and Querying Relations in Scientific Papers on Language Technology , 2008, LREC.

[41]  Dragomir R. Radev,et al.  The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics , 2008, LREC.

[42]  Anni Coden,et al.  CFE-A System for Testing , Evaluation and Machine Learning of UIMA Based Applications , 2008 .

[43]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.