Long-tail Vocabulary Dictionary Extraction from the Web

A dictionary --- a set of instances belonging to the same conceptual class --- is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall. In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the page-specific dictionaries of many webpages, Lyretail is able to output a high-quality comprehensive dictionary. Our experiments demonstrate that in long-tail vocabulary settings, we obtained a 17.3% improvement on mean average precision for the dictionary generation process, and a 30.7% improvement on F1 for the page-specific extraction, when compared to previous state-of-the-art methods.

[1]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[2]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[3]  William W. Cohen,et al.  Iterative Set Expansion of Named Entities Using the Web , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[6]  Zhe Chen,et al.  EgoSet: Exploiting Word Ego-networks and User-generated Ontology for Multifaceted Set Expansion , 2016, WSDM.

[7]  Zachary G. Ives,et al.  Data Integration on the Web , 2012 .

[8]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[9]  Frederick Reiss,et al.  Provenance-based dictionary refinement in information extraction , 2013, SIGMOD '13.

[10]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[11]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[12]  Le Zhao,et al.  Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction , 2013, ACL.

[13]  Yeye He,et al.  SEISA: set expansion by iterative similarity aggregation , 2011, WWW.

[14]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[15]  Frederick Reiss,et al.  SystemT: a system for declarative information extraction , 2009, SGMD.

[16]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[17]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[18]  Daniel S. Weld,et al.  Learning 5000 Relational Extractors , 2010, ACL.

[19]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[20]  Ashwin Machanavajjhala,et al.  An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[21]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[22]  William W. Cohen,et al.  Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[23]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[24]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[25]  Percy Liang,et al.  Zero-shot Entity Extraction from Web Pages , 2014, ACL.

[26]  Rosie Jones,et al.  Learning to Extract Entities from Labeled and Unlabeled Text , 2005 .

[27]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[28]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[29]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[30]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[31]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[32]  William W. Cohen,et al.  Character-level Analysis of Semi-Structured Documents for Set Expansion , 2009, EMNLP.

[33]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.