Efficient approximate entity extraction with edit distance constraints

Named entity recognition aims at extracting named entities from unstructured text. A recent trend of named entity recognition is finding approximate matches in the text with respect to a large dictionary of known entities, as the domain knowledge encoded in the dictionary helps to improve the extraction performance. In this paper, we study the problem of approximate dictionary matching with edit distance constraints. Compared to existing studies using token-based similarity constraints, our problem definition enables us to capture typographical or orthographical errors, both of which are common in entity extraction tasks yet may be missed by token-based similarity constraints. Our problem is technically challenging as existing approaches based on q-gram filtering have poor performance due to the existence of many short entities in the dictionary. Our proposed solution is based on an improved neighborhood generation method employing novel partitioning and prefix pruning techniques. We also propose an efficient document processing algorithm that minimizes unnecessary comparisons and enumerations and hence achieves good scalability. We have conducted extensive experiments on several publicly available named entity recognition datasets. The proposed algorithm outperforms alternative approaches by up to an order of magnitude.

[1]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[2]  Xiaohui Yu,et al.  Hashed samples: selectivity estimators for set similarity selection queries , 2008, Proc. VLDB Endow..

[3]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[4]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5]  Kyuseok Shim,et al.  Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance , 2007, VLDB.

[6]  Surajit Chaudhuri,et al.  Scalable ad-hoc entity extraction from text collections , 2008, Proc. VLDB Endow..

[7]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[8]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9]  Marius Pasca,et al.  Acquisition of categorized named entities for web search , 2004, CIKM '04.

[10]  Ricardo A. Baeza-Yates,et al.  Matchsimile: a Flexible Approximate Matching Tool for Searching Proper Name , 2003, J. Assoc. Inf. Sci. Technol..

[11]  R. Ewy,et al.  ABSTRACT , 1986 .

[12]  Jun'ichi Tsujii,et al.  Improving the performance of dictionary-based approaches in protein name recognition , 2004, J. Biomed. Informatics.

[13]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[14]  Eugene W. Myers,et al.  A sublinear algorithm for approximate keyword searching , 1994, Algorithmica.

[15]  Surajit Chaudhuri,et al.  Example-driven design of efficient record matching queries , 2007, VLDB.

[16]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[17]  Ron Sivan,et al.  Web-a-where: geotagging web content , 2004, SIGIR '04.

[18]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[19]  J. F. Wang,et al.  Assessment of approximate string matching in a biomedical text retrieval problem , 2005, Comput. Biol. Medicine.

[20]  Divesh Srivastava,et al.  Benchmarking declarative approximate selection predicates , 2007, SIGMOD '07.

[21]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[22]  Aaron Cohen Unsupervised Gene/Protein Named Entity Normalization Using Automatically Extracted Dictionaries , 2005, LBLODMBS@IDMB.

[23]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[24]  Justin Zobel,et al.  Finding approximate matches in large lexicons , 1995, Softw. Pract. Exp..

[25]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[26]  Surajit Chaudhuri,et al.  Leveraging aggregate constraints for deduplication , 2007, SIGMOD '07.

[27]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[28]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[29]  Surajit Chaudhuri,et al.  An efficient filter for approximate membership checking , 2008, SIGMOD Conference.

[30]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[31]  Sunita Sarawagi,et al.  Efficient Batch Top-k Search for Dictionary-based Entity Recognition , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[32]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[33]  Lorraine K. Tanabe,et al.  Generation of a Large Gene/protein Lexicon by Morphological Pattern Analysis , 2004, J. Bioinform. Comput. Biol..

[34]  Divesh Srivastava,et al.  Group Linkage , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[35]  Surajit Chaudhuri,et al.  Incorporating string transformations in record matching , 2008, SIGMOD Conference.

[36]  Chen Li,et al.  Selectivity Estimation for Fuzzy String Predicates in Large Data Sets , 2005, VLDB.

[37]  Divesh Srivastava,et al.  Estimating the selectivity of approximate string queries , 2007, TODS.

[38]  Hanan Samet,et al.  A Fast Similarity Join Algorithm Using Graphics Processing Units , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[39]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[40]  Philip A. Bernstein,et al.  Using Semi-Joins to Solve Relational Queries , 1981, JACM.

[41]  Bin Wang,et al.  Cost-based variable-length-gram selection for string collections to support approximate queries efficiently , 2008, SIGMOD Conference.