AML: Efficient Approximate Membership Localization within a Web-Based Join Framework

In this paper, we propose a new type of Dictionary-based Entity Recognition Problem, named Approximate Membership Localization (AML). The popular Approximate Membership Extraction (AME) provides a full coverage to the true matched substrings from a given document, but many redundancies cause a low efficiency of the AME process and deteriorate the performance of real-world applications using the extracted substrings. The AML problem targets at locating nonoverlapped substrings which is a better approximation to the true matched substrings without generating overlapped redundancies. In order to perform AML efficiently, we propose the optimized algorithm P-Prune that prunes a large part of overlapped redundant matched substrings before generating them. Our study using several real-word data sets demonstrates the efficiency of P-Prune over a baseline method. We also study the AML in application to a proposed web-based join framework scenario which is a search-based approach joining two tables using dictionary-based entity recognition from web documents. The results not only prove the advantage of AML over AME, but also demonstrate the effectiveness of our search-based approach.

[1]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[2]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3]  Surajit Chaudhuri,et al.  Exploiting web search engines to search structured databases , 2009, WWW '09.

[4]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[5]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[6]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[7]  Sunita Sarawagi,et al.  Efficient Batch Top-k Search for Dictionary-based Entity Recognition , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[9]  Lorraine K. Tanabe,et al.  Generation of a Large Gene/protein Lexicon by Morphological Pattern Analysis , 2004, J. Bioinform. Comput. Biol..

[10]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[11]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[12]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[13]  S. Vijayakumar,et al.  Proc. Advances in Neural Information Processing Systems (NIPS '06), Vancouver, Canada , 2006 .

[14]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[15]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[16]  Surajit Chaudhuri,et al.  Exploiting web search to generate synonyms for entities , 2009, WWW '09.

[17]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[18]  Lise Getoor,et al.  Link mining: a survey , 2005, SKDD.

[19]  Udi Manber,et al.  An Algorithm for Approximate Membership checking with Application to Password Security , 1994, Inf. Process. Lett..

[20]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[21]  Johannes Nowak,et al.  Text indexing with errors , 2007, J. Discrete Algorithms.

[22]  Tak Wah Lam,et al.  A linear size index for approximate pattern matching , 2011, J. Discrete Algorithms.

[23]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[24]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[25]  Ido Dagan,et al.  Contextual word similarity and estimation from sparse data , 1995, Comput. Speech Lang..

[26]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[27]  Wing-Kai Hon,et al.  Cache-Oblivious Index for Approximate String Matching , 2007, CPM.

[28]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[29]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[30]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[31]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[32]  Leszek Gasieniec,et al.  Approximate Dictionary Queries , 1996, CPM.

[33]  Hwee Tou Ng,et al.  Named Entity Recognition: A Maximum Entropy Approach Using Global Information , 2002, COLING.

[34]  Xiaoyong Du,et al.  Approximate membership localization (AML) for web-based join , 2010, CIKM '10.

[35]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[36]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[37]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[38]  Peter Sanders,et al.  Improved Fast Similarity Search in Dictionaries , 2010, SPIRE.

[39]  Andrew C. Yao,et al.  Dictionary Loop-Up with Small Errors , 1995, CPM.

[40]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[41]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[42]  Surajit Chaudhuri,et al.  An efficient filter for approximate membership checking , 2008, SIGMOD Conference.

[43]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[44]  Xiaofeng Meng,et al.  Efficient algorithms for approximate member extraction using signature-based inverted lists , 2009, CIKM.

[45]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[46]  R. Ewy,et al.  ABSTRACT , 1986 .

[47]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[48]  Chengqi Zhang,et al.  Efficient approximate entity extraction with edit distance constraints , 2009, SIGMOD Conference.