Evaluating and Integrating Databases in the Area of NLP

Since computational power is rapidly increasing, analyzing big data is getting more popular. This is exemplified by word embeddings producing huge index files of interrelated items. Another example is given by digital editions of corpora representing data on nested levels of text structuring. A third example relates to annotations of multimodal communication comprising nested and networked data of various (e.g., gestural or linguistic) modes. While the first example relates to graph-based models, the second one requires document models in the tradition of TEI whereas the third one combines both models. A central question is how to store and process such big and diverse data to support NLP and related routines in an efficient manner. In this paper, we evaluate six Database Management Systems as candidates for answering this question. This is done by regarding database operations in the context of six NLP routines. We show that none of the DBMS consistently works best. Rather, a family of them manifesting different database paradigms is required to cope with the need of processing big and divergent data. To this end, the paper introduces a web-based multi-database management system (MDBMS) as an interface to varieties of such databases.

[1]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[2]  Jean Carletta,et al.  The NITE Object Model Library for Handling Structured Linguistic Annotation on Multimodal Data Sets , 2002 .

[3]  Timothy James Grose,et al.  Mastering XMI: Java Programming with XMI, XML, and UML , 2002 .

[4]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[5]  Jeff Carpenter,et al.  Cassandra: The Definitive Guide , 2010 .

[6]  Jiawei Han,et al.  Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions , 2010, COLING.

[7]  Erhard W. Hinrichs,et al.  WebLicht: Web-Based LRT Services for German , 2010, ACL.

[8]  Zdenek Zabokrtský,et al.  TectoMT: Modular NLP Framework , 2010, IceTAL.

[9]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[10]  M. Sniedovich Dynamic programming : foundations and principles , 2011 .

[11]  Justin J. Miller,et al.  Graph Database Applications and Concepts with Neo4j , 2013 .

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Frank Puppe,et al.  Storing UIMA CASes in a relational database , 2013, UIMA@GSCL.

[14]  Jean-Cédric Chappelier,et al.  Bluima: a UIMA-based NLP Toolkit for Neuroscience , 2013, UIMA@GSCL.

[15]  Rick. Copeland MongoDB Applied Design Patterns , 2013 .

[16]  Brad Dayley NoSQL with MongoDB in 24 Hours, Sams Teach Yourself , 2014 .

[17]  Véronique Eglin,et al.  Learning-Free Text-Image Alignment for Medieval Manuscripts , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[18]  Iryna Gurevych,et al.  A broad-coverage collection of portable NLP components for building shareable analysis pipelines , 2014, OIAF4HLT@COLING.

[19]  Slav Petrov,et al.  Temporal Analysis of Language through Neural Language Models , 2014, LTCSS@ACL.

[20]  Sören Auer,et al.  AGDISTIS - Agnostic Disambiguation of Named Entities Using Linked Open Data , 2014, ECAI.

[21]  Michalis Vazirgiannis,et al.  Text Categorization as a Graph Classification Problem , 2015, ACL.

[22]  Danielle S. McNamara,et al.  Language to Completion: Success in an Educational Data Mining Massive Open Online Class , 2015, EDM.

[23]  Alexei Lavrentiev,et al.  Specifying a TEI-XML Based Format for Aligning Text to Image at Character Level , 2015 .

[24]  Michalis Vazirgiannis,et al.  A Graph Degeneracy-based Approach to Keyword Extraction , 2016, EMNLP.

[25]  Florian Matthes,et al.  LEXIA : A DATA SCIENCE ENVIRONMENT FOR SEMANTIC ANALYSIS OF GERMAN LEGAL TEXTS , 2016 .

[26]  Justin Reich,et al.  Forecasting student achievement in MOOCs with natural language processing , 2016, LAK.

[27]  Bernd Müller,et al.  Beyond Metadata: Enriching life science publications in Livivo with semantic entities from the linked data cloud , 2016, SEMANTiCS.

[28]  Alexander Mehler,et al.  On the Linearity of Semantic Change: Investigating Meaning Variation via Dynamic Graph Models , 2016, ACL.

[29]  Andreas Kuczera Digital Editions beyond XML - Graph-based Digital Editions , 2016, HistoInformatics@DH.

[30]  Tolga Uslu,et al.  TextImager: a Distributed UIMA-based System for NLP , 2016, COLING.

[31]  Bernard P. Veldkamp,et al.  Flexible NLP Pipelines for Digital Humanities Research , 2017, DH.

[32]  Alexander Mehler,et al.  A UIMA Database Interface for Managing NLP-related Text Annotations , 2018, LREC.

[33]  Kalpana Raja,et al.  Mining protein phosphorylation information from biomedical literature using NLP parsing and Support Vector Machines , 2018, Comput. Methods Programs Biomed..

[34]  Moustafa Al-Hajj,et al.  Automatic Identification of Arabic expressions related to future events in Lebanon's economy , 2018, ArXiv.

[35]  A V Gundlapalli,et al.  Development and Validation of a Natural Language Processing Tool to Identify Patients Treated for Pneumonia across VA Emergency Departments , 2018, Applied Clinical Informatics.