论文信息 - AML: Efficient Approximate Membership Localization within a Web-Based Join Framework - 字舞流文

AML: Efficient Approximate Membership Localization within a Web-Based Join Framework

In this paper, we propose a new type of Dictionary-based Entity Recognition Problem, named Approximate Membership Localization (AML). The popular Approximate Membership Extraction (AME) provides a full coverage to the true matched substrings from a given document, but many redundancies cause a low efficiency of the AME process and deteriorate the performance of real-world applications using the extracted substrings. The AML problem targets at locating nonoverlapped substrings which is a better approximation to the true matched substrings without generating overlapped redundancies. In order to perform AML efficiently, we propose the optimized algorithm P-Prune that prunes a large part of overlapped redundant matched substrings before generating them. Our study using several real-word data sets demonstrates the efficiency of P-Prune over a baseline method. We also study the AML in application to a proposed web-based join framework scenario which is a search-based approach joining two tables using dictionary-based entity recognition from web documents. The results not only prove the advantage of AML over AME, but also demonstrate the effectiveness of our search-based approach.

Xiaoyong Du | Laurianne Sitbon | Xiaofang Zhou | Zhixu Li | Liwei Wang

[1] Marc Moens,et al. Named Entity Recognition without Gazetteers , 1999, EACL.

[2] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3] Surajit Chaudhuri,et al. Exploiting web search engines to search structured databases , 2009, WWW '09.

[4] Matthew A. Jaro,et al. Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[5] Erik F. Tjong Kim Sang,et al. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[6] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[7] Sunita Sarawagi,et al. Efficient Batch Top-k Search for Dictionary-based Entity Recognition , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8] Anuradha Bhamidipaty,et al. Interactive deduplication using active learning , 2002, KDD.

[9] Lorraine K. Tanabe,et al. Generation of a Large Gene/protein Lexicon by Morphological Pattern Analysis , 2004, J. Bioinform. Comput. Biol..

[10] William W. Cohen,et al. Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[11] William W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[12] Wei Li,et al. Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[13] S. Vijayakumar,et al. Proc. Advances in Neural Information Processing Systems (NIPS '06), Vancouver, Canada , 2006 .

[14] Stuart J. Russell,et al. Identity Uncertainty and Citation Matching , 2002, NIPS.

[15] Piotr Indyk,et al. Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[16] Surajit Chaudhuri,et al. Exploiting web search to generate synonyms for entities , 2009, WWW '09.

[17] Alfred V. Aho,et al. Efficient string matching , 1975, Commun. ACM.

[18] Lise Getoor,et al. Link mining: a survey , 2005, SKDD.

[19] Udi Manber,et al. An Algorithm for Approximate Membership checking with Application to Password Security , 1994, Inf. Process. Lett..

[20] Christus,et al. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[21] Johannes Nowak,et al. Text indexing with errors , 2007, J. Discrete Algorithms.

[22] Tak Wah Lam,et al. A linear size index for approximate pattern matching , 2011, J. Discrete Algorithms.

[23] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[24] Ralph Grishman,et al. A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[25] Ido Dagan,et al. Contextual word similarity and estimation from sparse data , 1995, Comput. Speech Lang..

[26] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[27] Wing-Kai Hon,et al. Cache-Oblivious Index for Approximate String Matching , 2007, CPM.

[28] Raghav Kaushik,et al. Efficient exact set-similarity joins , 2006, VLDB.

[29] Surajit Chaudhuri,et al. A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[30] Jian Su,et al. Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[31] Pradeep Ravikumar,et al. A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[32] Leszek Gasieniec,et al. Approximate Dictionary Queries , 1996, CPM.

[33] Hwee Tou Ng,et al. Named Entity Recognition: A Maximum Entropy Approach Using Global Information , 2002, COLING.

[34] Xiaoyong Du,et al. Approximate membership localization (AML) for web-based join , 2010, CIKM '10.

[35] Luis Gravano,et al. Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[36] Gonzalo Navarro,et al. Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[37] Divesh Srivastava,et al. Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[38] Peter Sanders,et al. Improved Fast Similarity Search in Dictionaries , 2010, SPIRE.

[39] Andrew C. Yao,et al. Dictionary Loop-Up with Small Errors , 1995, CPM.

[40] Dekang Lin,et al. Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[41] Roberto J. Bayardo,et al. Scaling up all pairs similarity search , 2007, WWW '07.

[42] Surajit Chaudhuri,et al. An efficient filter for approximate membership checking , 2008, SIGMOD Conference.

[43] Charles Elkan,et al. The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[44] Xiaofeng Meng,et al. Efficient algorithms for approximate member extraction using signature-based inverted lists , 2009, CIKM.

[45] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[46] R. Ewy,et al. ABSTRACT , 1986 .

[47] Ahmed K. Elmagarmid,et al. Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[48] Chengqi Zhang,et al. Efficient approximate entity extraction with edit distance constraints , 2009, SIGMOD Conference.