Natural Language Processing: Integration of Automatic and Manual Analysis

There is a current trend to combine natural language analysis with research questions from the humanities. This requires an integration of automatic analysis with manual analysis, e.g. to develop a theory behind the analysis, to test the theory against a corpus, to generate training data for automatic analysis based on machine learning algorithms, and to evaluate the quality of the results from automatic analysis. Manual analysis is traditionally the domain of linguists, philosophers, and researchers from other humanities disciplines, who are often not expert programmers. Automatic analysis, on the other hand, is traditionally done by expert programmers, such as computer scientists and more recently computational linguists. It is important to bring these communities, their tools, and data closer together, to produce analysis of a higher quality with less effort. However, promising cooperations involving manual and automatic analysis, e.g. for the purpose of analyzing a large corpus, are hindered by many problems: - No comprehensive set of interoperable automatic analysis components is available. - Assembling automatic analysis components into workflows is too complex. - Automatic analysis tools, exploration tools, and annotation editors are not interoperable. - Workflows are not portable between computers. - Workflows are not easily deployable to a compute cluster. - There are no adequate tools for the selective annotation of large corpora. - In automatic analysis, annotation type systems are predefined, but manual annotation requires customizability. - Implementing new interoperable automatic analysis components is too complex. - Workflows and components are not sufficiently debuggable and refactorable. - Workflows that change dynamically via parametrization are not readily supported. - The user has no control over workflows that rely on expert skills from a different domain, undocumented knowledge, or third-party infrastructures, e.g. web services. In cooperation with researchers from the humanities, we develop innovative technical solutions and designs to facilitate the use of automatic analysis and to promote the integration of manual and automatic analysis. To address these issues, we set foundations in four areas: - Usability is improved by reducing the complexity of the APIs for building workflows and creating custom components, improving the handling of resources required by such components, and setting up auto-configuration mechanisms. - Reproducibility is improved through a concept for self-contained, portable analysis components and workflows combined with a declarative modeling approach for dynamic parametrized workflows, that facilitates avoiding unnecessary auxiliary manual steps in automatic workflows. - Flexibility is achieved by providing an extensive collection of interoperable automatic analysis components. We also compare annotation type systems used by different automatic analysis components to locate design patterns that allow for customization when used in manual analysis tasks. - Interactivity is achieved through a novel "annotation-by-query" process combining corpus search with annotation in a multi-user scenario. The process is supported by a web-based tool. We demonstrate the adequacy of our concepts through examples which represent whole classes of research problems. Additionally, we integrated all our concepts into existing open-source projects, or we implemented and published them within new open-source projects.

[1]  Valentin Tablan,et al.  Information Extraction and Semantic Annotation for Multi-Paradigm Information Management , 2011, Current Challenges in Patent Information Retrieval.

[2]  Marshall Schor,et al.  An Effective , Java-Friendly Interface to the CAS , 2004 .

[3]  Iryna Gurevych,et al.  DKPro-UGD: A Flexible Data-Cleansing Approach to Processing User-Generated Discourse , 2009 .

[4]  Peter Krause,et al.  Environmental modeling framework invasiveness: Analysis and implications , 2011, Environ. Model. Softw..

[5]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[6]  Sophia Ananiadou,et al.  Argo: an integrative, interactive, text mining-based workbench supporting curation , 2012, Database J. Biol. Databases Curation.

[7]  Ralph E. Johnson,et al.  Design Patterns: Abstraction and Reuse of Object-Oriented Design , 1993, ECOOP.

[8]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[9]  HAMISH CUNNINGHAM,et al.  Software architecture for language engineering , 2000 .

[10]  Mariona Taulé,et al.  AnCora: Multilevel Annotated Corpora for Catalan and Spanish , 2008, LREC.

[11]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[12]  Steven Bethard,et al.  ClearTK-TimeML: A minimalist approach to TempEval 2013 , 2013, *SEMEVAL.

[13]  Nicolas Hernandez Tackling interoperability issues within UIMA work flows , 2012, LREC.

[14]  Richard Johansson,et al.  The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies , 2008, CoNLL.

[15]  Chris Biemann,et al.  Text Segmentation with Topic Models , 2012, Journal for Language Technology and Computational Linguistics.

[16]  Nianwen Xue,et al.  The Bracketing Guidelines for the Chinese Treebank , 2000 .

[17]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[18]  Erhard W. Hinrichs,et al.  WebLicht: Web-based LRT Services in a Distributed eScience Infrastructure , 2010, LREC.

[19]  J. E. Ruiz,et al.  How Reliable is Your Workflow: Monitoring Decay in Scholarly Publications , 2013, SePublica.

[20]  Thilo Götz,et al.  Design and implementation of the UIMA Common Analysis System , 2004, IBM Syst. J..

[21]  Eric Nyberg,et al.  CSE Framework: A UIMA-based Distributed System for Configuration Space Exploration , 2013, UIMA@GSCL.

[22]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[23]  Stefanie Dipper,et al.  Towards User-Adaptive Annotation Guidelines , 2004 .

[24]  Iryna Gurevych,et al.  Automatically Classifying Edit Categories in Wikipedia Revisions , 2013, EMNLP.

[25]  Fernando Sánchez León A Spanish Tagset for the CRATER Project , 1994, ArXiv.

[26]  Sophia Ananiadou,et al.  Integrating Annotation Tools into UIMA for Interoperability , 2007 .

[27]  Sophia Ananiadou,et al.  Integrated NLP Evaluation System for Pluggable Evaluation Metrics with Extensive Interoperable Toolkit , 2009, Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing - SETQA-NLP '09.

[28]  Carole A. Goble,et al.  Best Practices for Workflow Design: How to Prevent Workflow Decay , 2012, SWAT4LS.

[29]  Victoria Stodden,et al.  The Legal Framework for Reproducible Scientific Research: Licensing and Copyright , 2009, Computing in Science & Engineering.

[30]  F. Puppe,et al.  TextMarker : A Tool for Rule-Based Information Extraction , 2009 .

[31]  Richard Eckart,et al.  Choosing an XML database for linguistically annotated corpora , 2008 .

[32]  Andrew P. Davison Automated Capture of Experiment Context for Easier Reproducibility in Computational Research , 2012, Computing in Science & Engineering.

[33]  Steven Bethard,et al.  Building Test Suites for UIMA Components , 2009 .

[34]  Stephan Schwiebert Tesla - ein virtuelles Labor für experimentelle Computer- und Korpuslinguistik , 2012 .

[35]  Erhard W. Hinrichs,et al.  The Tüba-D/Z Treebank: Annotating German with a Context-Free Backbone , 2004, LREC.

[36]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[37]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[38]  Kalina Bontcheva,et al.  Indexing and querying linguistic metadata and document content , 2007 .

[39]  Christian Biemann,et al.  JoBimText Visualizer: A Graph-based Approach to Contextualizing Distributional Similarity , 2013, TextGraphs@EMNLP.

[40]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[41]  Montserrat Marimon,et al.  The IULA Treebank , 2012, LREC.

[42]  Marta Mattoso,et al.  Exploring many task computing in scientific workflows , 2009, MTAGS '09.

[43]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[44]  Hongfang Liu,et al.  A common type system for clinical natural language processing , 2013, J. Biomed. Semant..

[45]  Shiyong Lu,et al.  A Dataflow-Based Scientific Workflow Composition Framework , 2012, IEEE Transactions on Services Computing.

[46]  Tony McEnery,et al.  The Lancaster Corpus of Mandarin Chinese , 2003 .

[47]  Jean Carletta,et al.  The NITE Object Model Library for Handling Structured Linguistic Annotation on Multimodal Data Sets , 2002 .

[48]  David A. Ferrucci,et al.  Introduction to "This is Watson" , 2012, IBM J. Res. Dev..

[49]  Nancy Ide,et al.  GrAF: A Graph-based Format for Linguistic Annotations , 2007, LAW@ACL.

[50]  Geoffrey James The Tao of Programming , 1987 .

[51]  Christian Chiarcos Ontologies of Linguistic Annotation: Survey and perspectives , 2012, LREC.

[52]  Martha Larson,et al.  Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches , 2000, INTERSPEECH.

[53]  Michael Halliday,et al.  Introducing functional grammar , 2004 .

[54]  Antero Taivalsaari,et al.  On the notion of inheritance , 1996, CSUR.

[55]  Sandra Bergmann,et al.  UIMA-HPC - Application Support and Speed-up of Data Extraction Workflows through UNICORE , 2012 .

[56]  Theodore Y. Ts'o,et al.  Kerberos: an authentication service for computer networks , 1994, IEEE Communications Magazine.

[57]  Max Mühlhäuser,et al.  Analyzing and accessing Wikipedia as a lexical semantic resource , 2007 .

[58]  David McKelvie,et al.  The MATE workbench - An annotation tool for XML coded speech corpora , 2001, Speech Commun..

[59]  Steven Bird,et al.  Fangorn: A System for Querying very large Treebanks , 2012, COLING.

[60]  Oliver Ferschke,et al.  FlawFinder: A Modular System for Predicting Quality Flaws in Wikipedia , 2012, CLEF.

[61]  Christian Biemann,et al.  NoSta-D Named Entity Annotation for German: Guidelines and Dataset , 2014, LREC.

[62]  Cláudio T. Silva,et al.  Making Computations and Publications Reproducible with VisTrails , 2012, Computing in Science & Engineering.

[63]  Mônica Holtz,et al.  Computational support for corpus analysis work flows: The case of integrating automatic and manual annotations , 2009 .

[64]  Iryna Gurevych,et al.  Semantic Service Retrieval Based on Natural Language Querying and Semantic Similarity , 2011, 2011 IEEE Fifth International Conference on Semantic Computing.

[65]  Simonetta Montemagni,et al.  Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank , 2013, LAW@ACL.

[66]  Sophia Ananiadou,et al.  An Annotation Type System for a Data-Driven NLP Pipeline , 2007, LAW@ACL.

[67]  Ichael,et al.  The UAM CorpusTool : software for corpus annotation and exploration , 2008 .

[68]  Reut Tsarfaty,et al.  A Unified Morpho-Syntactic Scheme of Stanford Dependencies , 2013, ACL.

[69]  Alessandro Moschitti,et al.  Making Tree Kernels Practical for Natural Language Learning , 2006, EACL.

[70]  Michael Gertz,et al.  Multilingual and cross-domain temporal tagging , 2012, Language Resources and Evaluation.

[71]  Jens Haase,et al.  HOO 2012 Shared Task: UKP Lab System Description , 2012, BEA@NAACL-HLT.

[72]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[73]  Arvind Malhotra,et al.  XML Schema Part 2: Datatypes Second Edition , 2004 .

[74]  Jon Louis Bentley,et al.  Programming pearls: little languages , 1986, CACM.

[75]  Gertjan van Noord,et al.  The Alpino Dependency Treebank , 2001, CLIN.

[76]  Nancy Ide,et al.  Representing Linguistic Corpora and Their Annotations , 2006, LREC.

[77]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[78]  Aleksander Slominski Adapting BPEL to Scientific Workflows , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[79]  Eric S. Raymond,et al.  The Art of Unix Programming , 2003 .

[80]  Heeyoung Lee,et al.  Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules , 2013, CL.

[81]  Nianwen Xue Annotation Guidelines for the Chinese Proposition Bank , 2007 .

[82]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[83]  Iryna Gurevych,et al.  A lightweight framework for reproducible parameter sweeping in information retrieval , 2011, DESIRE '11.

[84]  Frank Yellin,et al.  The java virtual machine , 1996 .

[85]  Christian Chiarcos,et al.  A Flexible Framework for Integrating Annotations from Different Tools and Tagsets , 2008 .

[86]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[87]  F. Xia,et al.  The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0) , 2000 .

[88]  Ann Bies,et al.  Bracketing Guidelines For Treebank II Style Penn Treebank Project , 1995 .

[89]  Tae-Gil Noh,et al.  Using UIMA to Structure An Open Platform for Textual Entailment , 2013, UIMA@GSCL.

[90]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[91]  Paul Taylor,et al.  Heterogeneous relation graphs as a formalism for representing linguistic information , 2001, Speech Commun..

[92]  Pierre Nugues,et al.  A High-Performance Syntactic and Semantic Dependency Parser , 2010, COLING.

[93]  Claus Zinn,et al.  Virtual Language Observatory: The Portal to the Language Resources and Technology Universe , 2010, LREC.

[94]  Marc Kemps-Snijders,et al.  ISOcat: remodelling metadata for language resources , 2009, Int. J. Metadata Semant. Ontologies.

[95]  Gunnar Eriksson,et al.  The Linguistic Annotation System of the Stockholm - Umea , 1993, EACL.

[96]  C. Drummond Replicability is not Reproducibility:Nor is it Good Science , 2009 .

[97]  David A. Ferrucci,et al.  Building an example application with the Unstructured Information Management Architecture , 2004, IBM Syst. J..

[98]  Petra Saskia Bayerl,et al.  What Determines Inter-Coder Agreement in Manual Annotations? A Meta-Analytic Investigation , 2011, CL.

[99]  Carole A. Goble,et al.  myExperiment: a repository and social network for the sharing of bioinformatics workflows , 2010, Nucleic Acids Res..

[100]  Jürgen Hermes Textprozessierung - Design und Applikation , 2012 .

[101]  Diana Maynard,et al.  JAPE: a Java Annotation Patterns Engine , 2000 .

[102]  Omer Levy,et al.  UKP-BIU: Similarity and Entailment Metrics for Student Response Analysis , 2013, *SEMEVAL.

[103]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[104]  Mônica Holtz,et al.  Scientific registers in contact: An exploration of the lexico-grammatical properties of interdisciplinary discourses , 2009 .

[105]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[106]  David I. Beaver,et al.  Bad Subject: (Non-)canonicality and NP Distribution in Existentials , 2005 .

[107]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[108]  Philip J. Guo,et al.  CDE: Using System Call Interposition to Automatically Create Portable Software Packages , 2011, USENIX Annual Technical Conference.

[109]  Iryna Gurevych,et al.  Towards Enhanced Interoperability for Large HLT Systems : UIMA for NLP , 2008 .

[110]  Katrin Erk,et al.  SALTO - A Versatile Multi-Level Annotation Tool , 2006, LREC.

[111]  H. Cunningham,et al.  Web-based Collaborative Corpus Annotation : Requirements and a Framework Implementation , 2010 .

[112]  Iryna Gurevych,et al.  WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations , 2013, ACL.

[113]  Iryna Gurevych,et al.  CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora , 2012, ACL.

[114]  Iryna Gurevych,et al.  Can We Hide in the Web? Large Scale Simultaneous Age and Gender Author Profiling in Social Media Notebook for PAN at CLEF 2013 , 2013, CLEF.

[115]  Christian Chiarcos,et al.  ANNIS: A Search Tool for Multi-Layer Annotated Corpora , 2009 .

[116]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[117]  Matthias L. Jugel,et al.  The radeox Wiki render engine , 2006, WikiSym '06.

[118]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[119]  Stefan Evert,et al.  Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium , 2011 .

[120]  Carole A. Goble,et al.  Why workflows break — Understanding and combating decay in Taverna workflows , 2012, 2012 IEEE 8th International Conference on E-Science.

[121]  Andreas Witt,et al.  A pragmatic approach to XML interoperability — the Component Metadata Infrastructure (CMDI) , 2011 .

[122]  Philip V. Ogren,et al.  Knowtator: A Protégé plug-in for annotated corpus construction , 2006, NAACL.

[123]  Christoph Müller,et al.  Multi-level annotation of linguistic data with MMAX 2 , 2006 .

[124]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[125]  Jim des Rivières,et al.  Eclipse: A platform for integrating development tools , 2004, IBM Syst. J..

[126]  Ken Klingenstein,et al.  Federated Security: The Shibboleth Approach , 2004 .

[127]  Victoria Stodden,et al.  Reproducible Research , 2019, The New Statistics with R.

[128]  Chris Drummond Reproducible Research: a Dissenting Opinion , 2012 .

[129]  Susana Afonso,et al.  Bíblia Florestal: Um manual lingüístico da Floresta Sintá(c)tica , 2007 .

[130]  Max Mühlhäuser,et al.  Darmstadt Knowledge Processing Repository Based on UIMA , 2007 .

[131]  James D. Mooney Bringing Portability to the Software Process , 2000 .

[132]  Kalina Bontcheva,et al.  GATECloud.net: a platform for large-scale, open-source text processing on the cloud , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[133]  Enrique Alfonseca,et al.  German Decompounding in a Difficult Corpus , 2008, CICLing.

[134]  K. Bretonnel Cohen,et al.  U-Compare: A modular NLP workflow construction and evaluation system , 2011, IBM J. Res. Dev..

[135]  Stefanie Dipper,et al.  Simple Annotation Tools for Complex Annotation Tasks : an Evaluation , 2004 .

[136]  Iain D. Craig,et al.  The Java Virtual Machine , 2006 .

[137]  Marie Candito,et al.  Expériences d’analyse syntaxique statistique du français , 2008, JEPTALNRECITAL.

[138]  Dennis Shasha,et al.  Packing experiments for sharing and publication , 2013, SIGMOD '13.

[139]  Kalina BontchevaHamish,et al.  Universities of Leeds, Sheffield and York , 2022 .

[140]  David Abramson,et al.  Parameter Space Exploration Using Scientific Workflows , 2009, ICCS.

[141]  Daniel Jurafsky,et al.  Discriminative Reordering with Chinese Grammatical Relations Features , 2009, SSST@HLT-NAACL.

[142]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[143]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[144]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[145]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[146]  Sophie Rosset,et al.  Modeling the Complexity of Manual Annotation Tasks: a Grid of Analysis , 2012, COLING.

[147]  Francisco Curbera,et al.  Web Services Business Process Execution Language Version 2.0 , 2007 .

[148]  Francois Yergeau,et al.  UTF-8, a transformation format of ISO 10646 , 1998, RFC.

[149]  Katrin Erk,et al.  The SALSA Corpus: a German Corpus Resource for Lexical Semantics , 2006, LREC.

[150]  Mark Liberman,et al.  A Formal Framework for Linguistic Annotation (revised version) , 2000, ArXiv.

[151]  Erhard W. Hinrichs,et al.  Service-oriented Architectures (SOAs) for the Humanities: Solutions and Impacts , 2012, DH.

[152]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[153]  Kalina Bontcheva,et al.  GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[154]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .

[155]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.