Ogmios: a scalable NLP platform for annotating large web document collections

While NLP tools are now widely available, their use can be problematic considering the lack of homogeneity of their input/output format, the variation of the granularity of their information, but also the difficulties to process large amounts of heterogeneous documents in a reasonable time, and their tunability to a specific domain. To address these problems, we propose a configurable platform to enrich very large collections of French and English specialised documents. The platform is a modularized framework. Each module carries out an annotation step by using existing NLP tools and can be tuned to a domain by adding specific resources: named entity recognition, sentence and word segmentation, lemmatisation, POS tagging, term tagging and parsing. Linguistic annotations are recorded in a stand-off XML format. We focus on the robustness of the annotation process to help the creation of annotated corpora from the web. We have tested the scalability of the platform on two collections of 55,329 biomedical web documents (107 millions of words) and 47,393 Search Engine News (13 millions of words). The collections have been annotated until the term tagging, respectively in 35 hours and 3 hours.

[1]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[2]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[3]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[4]  Eduard Hovy,et al.  The Terascale Challenge , 2022 .

[5]  Adeline Nazarenko,et al.  Adapting a general parser to a sublanguage , 2006, ArXiv.

[6]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[7]  Claire Nedellec,et al.  Sentence Filtering for Information Extraction in Genomics, a Classification Problem , 2001, PKDD.

[8]  Thierry Hamon,et al.  Event-Based Information Extraction for the Biomedical Domain: the Caderige Project , 2004, NLPBA/BioNLP.

[9]  Atanas Kiryakov,et al.  KIM – a semantic platform for information extraction and retrieval , 2004, Natural Language Engineering.

[10]  Kalina Bontcheva,et al.  Software Infrastructure for Language Resources: a Taxonomy of Previous Work and a Requirements Analysis , 2000, LREC.

[11]  Pasi Tapanainen,et al.  What is a word, What is a sentence? Problems of Tokenization , 1994 .

[12]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[13]  Kalina Bontcheva,et al.  Evolving GATE to meet new challenges in language engineering , 2004, Natural Language Engineering.

[14]  Antoine Widlöcher,et al.  La plate-forme LinguaStream : un outil d'exploration linguistique sur corpus , 2005 .

[15]  Fabienne Moreau,et al.  Revisiter le couplage traitement automatique des langues et recherche d'information , 2006 .

[16]  Thierry Hamon,et al.  The ALVIS Format for Linguistically Annotated Documents , 2006, LREC.

[17]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.