A Data-Centric Framework for Composable NLP Workflows

Empirical natural language processing (NLP) systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components, ranging from data ingestion, human annotation, to text retrieval, analysis, generation, and visualization. We establish a unified open-source framework to support fast development of such sophisticated NLP workflows in a composable manner. The framework introduces a uniform data representation to encode heterogeneous results by a wide range of NLP tasks. It offers a large repository of processors for NLP tasks, visualization, and annotation, which can be easily assembled with full interoperability under the unified representation. The highly extensible framework allows plugging in custom processors from external off-the-shelf NLP and deep learning libraries. The whole framework is delivered through two modularized yet integratable open-source projects, namely Forte (for workflow infrastructure and NLP function processors) and Stave (for user interaction, visualization, and annotation).

[1]  Iryna Gurevych,et al.  A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures , 2016, LT4DH@COLING.

[2]  Will Styler,et al.  Anafora: A Web-based General Purpose Annotation Tool , 2013, HLT-NAACL.

[3]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[4]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[5]  Philip V. Ogren,et al.  Knowtator: A Protégé plug-in for annotated corpus construction , 2006, NAACL.

[6]  Graham Neubig,et al.  Generalizing Natural Language Analysis through Span-relation Representations , 2020, ACL.

[7]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[8]  Jie Yang,et al.  YEDDA: A Lightweight Collaborative Text Span Annotation Tool , 2017, ACL.

[9]  Eric P. Xing,et al.  Knowledge-driven Encode, Retrieve, Paraphrase for Medical Image Report Generation , 2019, AAAI.

[10]  Anna Rumshisky,et al.  CliNER : A Lightweight Tool for Clinical Named Entity Recognition , 2015 .

[11]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Stéphane M. Meystre,et al.  Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes , 2015, J. Biomed. Informatics.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Dan Roth,et al.  An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines) , 2012, LREC.

[16]  Iryna Gurevych,et al.  A broad-coverage collection of portable NLP components for building shareable analysis pipelines , 2014, OIAF4HLT@COLING.

[17]  Steven Bethard,et al.  ClearTK 2.0: Design Patterns for Machine Learning in UIMA , 2014, LREC.

[18]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[19]  Omer Levy,et al.  Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling , 2018, ACL.

[20]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[21]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[22]  Eric P. Xing,et al.  Texar: A Modularized, Versatile, and Extensible Toolkit for Text Generation , 2018, ACL.

[23]  Kalina Bontcheva,et al.  Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics , 2013, PLoS Comput. Biol..

[24]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[25]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[26]  Thilo Götz,et al.  Design and implementation of the UIMA Common Analysis System , 2004, IBM Syst. J..