The Tanl Pipeline

Tanl (Natural Language Text Analytics) is a suite of tools for text analytics based on the software architecture paradigm of data pipelines. Tanl pipelines are data driven, i.e. each stage pulls data from the preceding stage and transforms them for use by the next stage. Since data is processed as soon as it becomes available, processing delay is minimized improving data throughput. The processing modules can be written in C++ or in Python and can be combined using few lines of Python scripts to produce full NLP applications. Tanl provides a set of modules, ranging from tokenization to POS tagging, from parsing to NE recognition. A Tanl pipeline can be processed in parallel on a cluster of computers by means of a modified version of Hadoop streaming. We present the architecture, its modules and some sample applications. Introduction Text analytics involves many tasks ranging from simple text collection, extraction, and preparation to linguistic syntactic and semantic analysis, cross reference analysis, intent mining and finally indexing and search. A complete system must be able to process textual data of any size and structure, to extract words, to classify documents into categories (taxonomies or ontologies), and to identify semantic relationships. A full analytics application requires coordinating and combining several tools designed to handle specific subtasks. This may be challenging since many of the existing tools have been developed independently with different requirements and assumptions on how to process the data. Several suites for NLP (Natural Language Processing) are available for performing syntactic and semantic data analysis, some as open source and other as commercial products. These toolsets can be grouped into two broad software architecture categories: Integrated Toolkits: these provide a set of classes and methods for each task, and are typically bound to a programming language. Applications are programmed using compilers and standard programming environments. Examples in this category are: LingPipe (LingPipe), OpenNlp (OpenNLP), NLTK (NLTK). Component Frameworks: these use generic data structures, described in a language independent formalism, and each tool consumes/produces such data; a special compiler transforms the data descriptions into types for the target programming language. Applications are built using specific framework tools. Examples in this category are: GATE (GATE), UIMA (UIMA). Both GATE and UIMA are based on a workflow software architecture, where the framework handles the workflow among the processing stages of the application, by means of a controller that passes data among the components invoking their methods. Each tool accepts and returns the same type of data and extends the data it receives by adding its own information, as shown using different colors in Figure 1: the Tokenizer adds annotations to represent the start and end of each token, the PosTagger adds annotations representing the POS for each token. Since the controller handles the whole processing in a single flow, each processing component receives the whole collection and returns the whole collection. If the collection is big, this might require large amounts of memory. Figure 1: Workflow Software Architecture. In this paper we present an alternative architecture based on the notion of data pipeline. The Tanl pipeline (Natural Language Text Analytics) uses both generic and specific data structures, and components communicate directly exchanging data through pipes, as shown in Figure 2. Since each tool pulls the data it needs from the previous stage of the pipeline, only the minimum amount of data passes through the pipeline, therefore reducing the memory footprint and improving the throughput. The figure shows single documents being passed along, but the granularity can be even smaller: for instance a module might just require single tokens or single sentences. This would be hard to handle with a workflow architecture, since the controller does not know which amount of data Controller