论文信息 - Long-tail Vocabulary Dictionary Extraction from the Web - 字舞流文

Long-tail Vocabulary Dictionary Extraction from the Web

A dictionary --- a set of instances belonging to the same conceptual class --- is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall. In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the page-specific dictionaries of many webpages, Lyretail is able to output a high-quality comprehensive dictionary. Our experiments demonstrate that in long-tail vocabulary settings, we obtained a 17.3% improvement on mean average precision for the dictionary generation process, and a 30.7% improvement on F1 for the page-specific extraction, when compared to previous state-of-the-art methods.

Zhe Chen | H. V. Jagadish | Michael J. Cafarella | H. Jagadish | Zhe Chen

[1] Daniel Jurafsky,et al. Distant supervision for relation extraction without labeled data , 2009, ACL.

[2] Ravi Kumar,et al. Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[3] William W. Cohen,et al. Iterative Set Expansion of Named Entities Using the Web , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5] Dacheng Tao,et al. A Survey on Multi-view Learning , 2013, ArXiv.

[6] Zhe Chen,et al. EgoSet: Exploiting Word Ego-networks and User-generated Ontology for Multifaceted Set Expansion , 2016, WSDM.

[7] Zachary G. Ives,et al. Data Integration on the Web , 2012 .

[8] Oren Etzioni,et al. Open Information Extraction from the Web , 2007, CACM.

[9] Frederick Reiss,et al. Provenance-based dictionary refinement in information extraction , 2013, SIGMOD '13.

[10] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[11] Eric Crestan,et al. Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[12] Le Zhao,et al. Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction , 2013, ACL.

[13] Yeye He,et al. SEISA: set expansion by iterative similarity aggregation , 2011, WWW.

[14] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[15] Frederick Reiss,et al. SystemT: a system for declarative information extraction , 2009, SGMD.

[16] Alon Y. Halevy,et al. Principles of Data Integration , 2012 .

[17] Valter Crescenzi,et al. RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[18] Daniel S. Weld,et al. Learning 5000 Relational Extractors , 2010, ACL.

[19] Wei Zhang,et al. Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[20] Ashwin Machanavajjhala,et al. An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[21] Daniel S. Weld,et al. Autonomously semantifying wikipedia , 2007, CIKM '07.

[22] William W. Cohen,et al. Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[23] Valter Crescenzi,et al. RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[24] Doug Downey,et al. Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[25] Percy Liang,et al. Zero-shot Entity Extraction from Web Pages , 2014, ACL.

[26] Rosie Jones,et al. Learning to Extract Entities from Labeled and Unlabeled Text , 2005 .

[27] Sergey Brin,et al. Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[28] Gideon S. Mann,et al. Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[29] Nicholas Kushmerick,et al. Wrapper Induction for Information Extraction , 1997, IJCAI.

[30] Nitesh V. Chawla,et al. Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[31] Yoram Singer,et al. Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[32] William W. Cohen,et al. Character-level Analysis of Semi-Structured Documents for Set Expansion , 2009, EMNLP.

[33] Luis Gravano,et al. Snowball: extracting relations from large plain-text collections , 2000, DL '00.