Building an NLP pipeline within a digital publishing workflow

Outside the laboratory environment, NLP tool developers have always been obliged to use robust techniques in order to clean and streamline the ubiquitous formats of authentic texts. In most cases, the cleaned version simply consisted of the bare text discarded of all typographical information, tokenised in such a way that even the reconstruction of a simple sentence resulted in a displeasing layout. In order to integrate the NLP output within the production workflow of digital publications, it is necessary to keep track of the original layout. In this paper, we present an example of an NLP pipeline developed to meet the requirements of real-world applications of digital publications. The NLP pipeline presented here was developed within the framework of the iRead+ project, a cooperative research project between several industrial and academic partners in Flanders. The pipeline aims at enabling automatic enrichment of texts with word-specific and contextual information in order to create an enhanced reading experience on tablets and to support automatic generation of grammatical exercises. The enriched documents contain both linguistic annotations (part-of-speech and lemmata) and semantic annotations based on the recognition and disambiguation of named entities. The whole enrichment process, provided via a web service, can be integrated into an XML-based production flow. The input of the NLP enrichment engine consists of two documents: a well-formed XML source file and a control file containing XPath expressions describing the nodes in the source file to be annotated and enriched. As nodes may contain a pre-defined set of mixed data, reconstruction of the original document (with selected enrichments) is enabled.

[1]  Els Lefever,et al.  LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit , 2013, CLIN 2013.

[2]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[3]  Alexis Nasr,et al.  MACAON : Une chaîne linguistique pour le traitement de graphes de mots , 2009 .

[4]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[5]  Marie Candito,et al.  Expériences d’analyse syntaxique statistique du français , 2008, JEPTALNRECITAL.

[6]  Marc Moens,et al.  LT TTT - A Flexible Tokenisation Tool , 2000, LREC.

[7]  Alexandre Allauzen,et al.  Training and Evaluation of POS Taggers on the French MULTITAG Corpus , 2008, LREC.

[8]  Andreas Witt,et al.  Journal of the Text Encoding Initiative , 2012 .

[9]  Frédéric Béchet,et al.  MACAON An NLP Tool Suite for Processing Word Lattices , 2011, ACL.

[10]  Filip Gralinski,et al.  PSI-Toolkit: A Natural Language Processing Pipeline , 2013, Computational Linguistics - Applications.

[11]  Martin Reynaert,et al.  FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study , 2014, CLIN 2014.

[12]  Walter Daelemans,et al.  Memory-Based Language Processing (Studies in Natural Language Processing) , 2005 .

[13]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[14]  Claire Grover,et al.  Rule-Based Chunking and Reusability , 2006, LREC.

[15]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[16]  Gosse Bouma,et al.  Essential Speech and Language Technology for Dutch , 2012 .

[17]  Maik Stührenberg The TEI and Current Standards for Structuring Linguistic Data. An Overview , 2012 .

[18]  Patrick Paroubek,et al.  The GRACE french part-of-speech tagging evaluation task , 1998, LREC.

[19]  Pascal Denis,et al.  Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging , 2012, Lang. Resour. Evaluation.

[20]  Patrick Paroubek Language Resources as by-Product of Evaluation: The MULTITAG Example , 2000, LREC.

[21]  Renato Rocha Souza,et al.  PyPLN: a Distributed Platform for Natural Language Processing , 2013, ArXiv.

[22]  Hans Paulussen,et al.  Dutch Parallel Corpus: A Balanced Parallel Corpus for Dutch-English and Dutch-French , 2013, Essential Speech and Language Technology for Dutch.

[23]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[24]  Véronique Hoste,et al.  Fine-grained Dutch named entity recognition , 2014, Lang. Resour. Evaluation.

[25]  Lou Burnard Resolving the Durand Conundrum , 2013 .