论文信息 - Challenges for Information Extraction in the Oil and Gas Domain

Challenges for Information Extraction in the Oil and Gas Domain

Increasingly, governments, corporations, and scientific organizations need to extract complex information from highly technical documents. While linguistic resources exist in some technical domains, they are largely unavailable for the oil and gas domain. We applied natural language processing tools with minimum domain adaptation to extract information from 155 annotated text passages from geological reports. In recognizing oil field entity names, we achieved a precision of .94 and recall of .43 (F1=.59) without supervised learning. We describe the impact of errors found in the output, including incorrect segmentation, part-of-speech tags, multiword expressions, word sense disambiguation, numeric quantities, and other issues leading to incorrect entity classifications. These mistakes could be reduced with a domain-specific dictionary that includes part-of-speech tags. Resumo. Cada vez mais governos, corporações e instituições cientı́ficas precisam extrair informações complexas de documentos técnicos. Enquanto recursos linguı́sticos existem em alguns domı́nios técnicos, estes estão em grande parte indisponı́vel para o domı́nio de petróleo e gás. Nós aplicamos ferramentas de processamento de texto com mı́nima adaptação ao domı́nio para extrair informações de 155 passagens de texto de relatórios geológicos anotados. Ao reconhecer os nomes das entidades dos campos de petróleo, alcançamos uma precisão de .94 e um recall de .43 (F1 = .59) sem aprendizagem supervisionada. Nós descrevemos o impacto dos erros de segmentação de sentenças, tagging, identificação de expressões multi-palavra, desambiguação do sentido das palavras, e outras questões, na classificação incorreta das entidades. Os erros encontrados poderiam, em sua maioria, serem evitados com um dicionário especı́fico de domı́nio.

Alexandre Rademaker

[1] Sampo Pyysalo,et al. Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[2] Anna Maria Di Sciullo,et al. Natural Language Understanding , 2009, SoMeT.

[3] Betsy Palkowsky. A New Approach to Information Discovery , 2005 .

[4] Xiaoqiang Luo,et al. A Statistical Model for Multilingual Entity Detection and Tracking , 2004, NAACL.

[5] Milan Straka,et al. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[6] Eneko Agirre,et al. Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[7] Mark Steedman,et al. Transforming Dependency Structures to Logical Forms for Semantic Parsing , 2016, TACL.

[8] Simone Teufel,et al. An Architecture for Language Processing for Scientific Texts , 2006 .

[9] Fabricio Chalub,et al. Extending Wordnet to Geological Times , 2018, GWC.

[10] Frederick Reiss,et al. SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[11] Dan Flickinger,et al. On building a more effcient grammar by exploiting types , 2000, Natural Language Engineering.