Challenges for Information Extraction in the Oil and Gas Domain

Increasingly, governments, corporations, and scientific organizations need to extract complex information from highly technical documents. While linguistic resources exist in some technical domains, they are largely unavailable for the oil and gas domain. We applied natural language processing tools with minimum domain adaptation to extract information from 155 annotated text passages from geological reports. In recognizing oil field entity names, we achieved a precision of .94 and recall of .43 (F1=.59) without supervised learning. We describe the impact of errors found in the output, including incorrect segmentation, part-of-speech tags, multiword expressions, word sense disambiguation, numeric quantities, and other issues leading to incorrect entity classifications. These mistakes could be reduced with a domain-specific dictionary that includes part-of-speech tags. Resumo. Cada vez mais governos, corporações e instituições cientı́ficas precisam extrair informações complexas de documentos técnicos. Enquanto recursos linguı́sticos existem em alguns domı́nios técnicos, estes estão em grande parte indisponı́vel para o domı́nio de petróleo e gás. Nós aplicamos ferramentas de processamento de texto com mı́nima adaptação ao domı́nio para extrair informações de 155 passagens de texto de relatórios geológicos anotados. Ao reconhecer os nomes das entidades dos campos de petróleo, alcançamos uma precisão de .94 e um recall de .43 (F1 = .59) sem aprendizagem supervisionada. Nós descrevemos o impacto dos erros de segmentação de sentenças, tagging, identificação de expressões multi-palavra, desambiguação do sentido das palavras, e outras questões, na classificação incorreta das entidades. Os erros encontrados poderiam, em sua maioria, serem evitados com um dicionário especı́fico de domı́nio.

[1]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[2]  Anna Maria Di Sciullo,et al.  Natural Language Understanding , 2009, SoMeT.

[3]  Betsy Palkowsky A New Approach to Information Discovery , 2005 .

[4]  Xiaoqiang Luo,et al.  A Statistical Model for Multilingual Entity Detection and Tracking , 2004, NAACL.

[5]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[6]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[7]  Mark Steedman,et al.  Transforming Dependency Structures to Logical Forms for Semantic Parsing , 2016, TACL.

[8]  Simone Teufel,et al.  An Architecture for Language Processing for Scientific Texts , 2006 .

[9]  Fabricio Chalub,et al.  Extending Wordnet to Geological Times , 2018, GWC.

[10]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[11]  Dan Flickinger,et al.  On building a more effcient grammar by exploiting types , 2000, Natural Language Engineering.

[12]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[13]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[14]  M. Antoniak,et al.  Natural Language Processing Techniques on Oil and Gas Drilling Data , 2016 .

[15]  Ulrich Callmeier,et al.  PET – a platform for experimentation with efficient HPSG processing techniques , 2000, Natural Language Engineering.

[16]  Anna Lisa Gentile,et al.  UNIBA: JIGSAW algorithm for Word Sense Disambiguation , 2007, SemEval@ACL.

[17]  Christopher D. Manning,et al.  Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks , 2016, LREC.

[18]  Simone Teufel,et al.  Robust Argumentative Zoning for Sensemaking in Scholarly Documents , 2009, NLP4DL/AT4DL.

[19]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[20]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[21]  Hamidah Ibrahim,et al.  Improving named entity recognition accuracy for gene and protein in biomedical text literature , 2014, Int. J. Data Min. Bioinform..

[22]  Paul Fodor,et al.  The Prolog Interface to the Unstructured Information Management Architecture , 2008, ArXiv.

[23]  Jill Feblowitz,et al.  Analytics in Oil and Gas: The Big Deal About Big Data , 2013 .

[24]  Eckhard Bick,et al.  Floresta Sintá(c)tica: Bigger, Thicker and Easier , 2008, PROPOR.

[25]  Ido Dagan,et al.  Getting More Out Of Syntax with PropS , 2016, ArXiv.

[26]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.