Leveraging Pattern Semantics for Extracting Entities in Enterprises

Entity Extraction is a process of identifying meaningful entities from text documents. In enterprises, extracting entities improves enterprise efficiency by facilitating numerous applications, including search, recommendation, etc. However, the problem is particularly challenging on enterprise domains due to several reasons. First, the lack of redundancy of enterprise entities makes previous web-based systems like NELL and OpenIE not effective, since using only high-precision/low-recall patterns like those systems would miss the majority of sparse enterprise entities, while using more low-precision patterns in sparse setting also introduces noise drastically. Second, semantic drift is common in enterprises ("Blue" refers to "Windows Blue"), such that public signals from the web cannot be directly applied on entities. Moreover, many internal entities never appear on the web. Sparse internal signals are the only source for discovering them. To address these challenges, we propose an end-to-end framework for extracting entities in enterprises, taking the input of enterprise corpus and limited seeds to generate a high-quality entity collection as output. We introduce the novel concept of Semantic Pattern Graph to leverage public signals to understand the underlying semantics of lexical patterns, reinforce pattern evaluation using mined semantics, and yield more accurate and complete entities. Experiments on Microsoft enterprise data show the effectiveness of our approach.

[1]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[2]  M. Chial,et al.  in simple , 2003 .

[3]  Doug Downey,et al.  Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison , 2004, AAAI.

[4]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[5]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[6]  Pedro M. Domingos,et al.  Unsupervised Ontology Induction from Text , 2010, ACL.

[7]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[8]  Alessandro Lenci,et al.  Distributional semantics in linguistic and cognitive research , 2008 .

[9]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[10]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[11]  Christopher D. Manning,et al.  SPIED: Stanford Pattern based Information Extraction and Diagnostics , 2014 .

[12]  William W. Cohen,et al.  Iterative Set Expansion of Named Entities Using the Web , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[13]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[14]  Dayne Freitag,et al.  Toward General-Purpose Learning for Information Extraction , 1998, ACL.

[15]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[16]  Christopher D. Manning,et al.  Improved Pattern Learning for Bootstrapped Entity Extraction , 2014, CoNLL.

[17]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[18]  William W. Cohen,et al.  Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[19]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[20]  Doug Downey,et al.  Learning text patterns for web information extraction and assessment , 2004, AAAI 2004.

[21]  Yoram Singer,et al.  A simple, fast, and effective rule learner , 1999, AAAI 1999.

[22]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[23]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[24]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[25]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[26]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[27]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[28]  Ben Taskar,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[29]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.