Graph-based weakly-supervised methods for information extraction & integration

The variety and complexity of potentially-related data resources available for querying—webpages, databases, data warehouses—has been growing ever more rapidly. There is a growing need to pose integrative queries across multiple such sources, exploiting foreign keys and other means of interlinking data to merge information from diverse sources. This has traditionally been the focus of research within Information Extraction (IE) and Information Integration (II) communities, with IE focusing on converting unstructured sources into structured sources, and II focusing on providing a unified view of diverse structured data sources. However, most of the current IE and II methods, which can potentially be applied to the problem of integration across sources, require large amounts of human supervision, often in the form of annotated data. This need for extensive supervision makes existing methods expensive to deploy and difficult to maintain. In this thesis, we develop techniques that generalize from limited human input, via weakly-supervised methods for IE and II. In particular, we argue that graph-based representation of data and learning over such graphs can result in effective and scalable methods for large-scale Information Extraction and Integration. Within IE, we focus on the problem of assigning semantic classes to entities. First we develop a context pattern induction method to extend small initial entity lists of various semantic classes. We also demonstrate that features derived from such extended entity lists can significantly improve performance of state-of-the-art discriminative taggers. The output of pattern-based class-instance extractors is often high-precision and low-recall in nature, which is inadequate for many real world applications. We use Adsorption, a graph based label propagation algorithm, to significantly increase recall of an initial high-precision, low-recall pattern-based extractor by combining evidences from unstructured and structured text corpora. Building on Adsorption, we propose a new label propagation algorithm, Modified Adsorption (MAD), and demonstrate its effectiveness on various real-world datasets. Additionally, we also show how class-instance acquisition performance in the graph-based SSL setting can be improved by incorporating additional semantic constraints available in independently developed knowledge bases. Within Information Integration, we develop a novel system, Q, which draws ideas from machine learning and databases to help a non-expert user construct data-integrating queries based on keywords (across databases) and interactive feedback on answers. We also present an information need-driven strategy for automatically incorporating new sources and their information in Q. We also demonstrate that Q's learning strategy is highly effective in combining the outputs of "black box" schema matchers and in re-weighting bad alignments. This removes the need to develop an expensive mediated schema which has been necessary for most previous systems.

[1]  Dan Suciu,et al.  Schema mediation in peer data management systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[2]  Katherine A. Heller,et al.  Bayesian Sets , 2005, NIPS.

[3]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[4]  Vagelis Hristidis,et al.  ObjectRank: Authority-Based Keyword Search in Databases , 2004, VLDB.

[5]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[6]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[7]  Benjamin Van Durme,et al.  What You Seek Is What You Get: Extraction of Class Attributes from Query Logs , 2007, IJCAI.

[8]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[9]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[10]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11]  Ralph Grishman,et al.  Bootstrapped Learning of Semantic Classes from Positive and Negative Examples , 2003 .

[12]  Michael J. Carey,et al.  XPERANTO: Publishing Object-Relational Data as XML , 2000, WebDB.

[13]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[14]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[15]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[16]  Gerhard Weikum,et al.  NAGA: Searching and Ranking Knowledge , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[17]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[18]  Pawel Winter,et al.  Steiner problem in networks: A survey , 1987, Networks.

[19]  Victor J. Katz,et al.  The History of Stokes' Theorem , 1979 .

[20]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[21]  Craig A. Knoblock,et al.  Building Mashups by example , 2008, IUI '08.

[22]  Susan B. Davidson,et al.  BioGuideSRS: querying multiple sources with a user-centric perspective , 2007, Bioinform..

[23]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[24]  Partha Pratim Talukdar,et al.  Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition , 2010, ACL.

[25]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[26]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[27]  Patrick Pantel,et al.  Entity Extraction via Ensemble Semantics , 2009, EMNLP.

[28]  Tong Zhang,et al.  A Robust Risk Minimization based Named Entity Recognition System , 2003, CoNLL.

[29]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[30]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[31]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[32]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[33]  G. Nemhauser,et al.  Integer Programming , 2020 .

[34]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[35]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[36]  Koby Crammer,et al.  Learning to create data-integrating queries , 2008, Proc. VLDB Endow..

[37]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[38]  Fernando Pereira,et al.  Lightly-Supervised Attribute Extraction , 2007 .

[39]  E. Lawler A PROCEDURE FOR COMPUTING THE K BEST SOLUTIONS TO DISCRETE OPTIMIZATION PROBLEMS AND ITS APPLICATION TO THE SHORTEST PATH PROBLEM , 1972 .

[40]  S. Sudarshan,et al.  Bidirectional Expansion For Keyword Search on Graph Databases , 2005, VLDB.

[41]  Yehoshua Sagiv,et al.  Finding and approximating top-k answers in keyword proximity search , 2006, PODS '06.

[42]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[43]  The Plasmodium genome database Designing and mining a eukaryotic genomics resource , .

[44]  Pawel Winter,et al.  Path-distance heuristics for the Steiner problem in undirected networks , 1992, Algorithmica.

[45]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[46]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[47]  William W. Cohen,et al.  Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[48]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[49]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[50]  Pawan Kumar,et al.  Notice of Violation of IEEE Publication Principles The Anatomy of a Large-Scale Hyper Textual Web Search Engine , 2009 .

[51]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[52]  Erhard Rahm,et al.  Matching large schemas: Approaches and evaluation , 2007, Inf. Syst..

[53]  Koby Crammer,et al.  Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction , 2007, PLoS Comput. Biol..

[54]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[55]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[56]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[57]  Gerhard Weikum,et al.  Fine-grained relevance feedback for XML retrieval , 2008, SIGIR '08.

[58]  Louiqa Raschid,et al.  Explaining and Reformulating Authority Flow Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[59]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[60]  Philip S. Yu,et al.  BLINKS: ranked keyword searches on graphs , 2007, SIGMOD '07.

[61]  Koby Crammer,et al.  New Regularized Algorithms for Transductive Learning , 2009, ECML/PKDD.

[62]  J. Y. Yen Finding the K Shortest Loopless Paths in a Network , 1971 .

[63]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[64]  Andrew McCallum,et al.  Generalized Expectation Criteria for Bootstrapping Extractors using Record-Text Alignment , 2009, EMNLP.

[65]  William W. Cohen,et al.  Learning to rank typed graph walks: local and global approaches , 2007, WebKDD/SNA-KDD '07.

[66]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[67]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[68]  Partha Pratim Talukdar,et al.  Automatically incorporating new sources in keyword search-based data integration , 2010, SIGMOD Conference.

[69]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[70]  Jirí Matousek,et al.  Low-Distortion Embeddings of Finite Metric Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[71]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[72]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[73]  Thorsten Brants,et al.  A Context Pattern Induction Method for Named Entity Extraction , 2006, CoNLL.

[74]  Jayavel Shanmugasundaram,et al.  Context-Sensitive Keyword Search and Ranking for XML , 2005, WebDB.

[75]  Eytan Ruppin,et al.  Unsupervised learning of natural languages , 2006 .

[76]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[77]  Tong Zhang,et al.  A High-Performance Semi-Supervised Learning Method for Text Chunking , 2005, ACL.

[78]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[79]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[80]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[81]  Jeff A. Bilmes,et al.  Soft-Supervised Learning for Text Classification , 2008, EMNLP.

[82]  Vagelis Hristidis,et al.  Keyword proximity search on XML graphs , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[83]  Thorsten Joachims,et al.  Transductive Learning via Spectral Graph Partitioning , 2003, ICML.

[84]  Jie Zhao,et al.  Schema Mediation in Peer Data Management Systems , 2011, Int. J. Cooperative Inf. Syst..

[85]  Marius Pasca,et al.  Bootstrapped extraction of class attributes , 2009, WWW '09.

[86]  Raghu Ramakrishnan,et al.  Toward best-effort information extraction , 2008, SIGMOD Conference.

[87]  Kevin Chen-Chuan Chang,et al.  RankSQL: query algebra and optimization for relational top-k queries , 2005, SIGMOD '05.

[88]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[89]  Gerhard Weikum,et al.  STAR: Steiner-Tree Approximation in Relationship Graphs , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[90]  Jeffrey F. Naughton,et al.  Efficiently incorporating user feedback into information extraction and integration programs , 2009, SIGMOD Conference.

[91]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[92]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[93]  Fernando Pereira,et al.  Online Learning of Approximate Dependency Parsing Algorithms , 2006, EACL.

[94]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[95]  Oren Etzioni,et al.  Towards a theory of natural language interfaces to databases , 2003, IUI.

[96]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[97]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[98]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[99]  Val Tannen,et al.  Update Exchange with Mappings and Provenance , 2007, VLDB.

[100]  Benjamin Van Durme,et al.  Finding Cars, Goddesses and Enzymes: Parametrizable Acquisition of Labeled Instances for Open-Domain Information Extraction , 2008, AAAI.

[101]  Patrick Pantel,et al.  Concept Discovery from Text , 2002, COLING.

[102]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[103]  Partha Pratim Talukdar,et al.  Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks , 2008, EMNLP.

[104]  Rayid Ghani,et al.  Semi-Supervised Learning of Attribute-Value Pairs from Product Descriptions , 2007, IJCAI.

[105]  Alon Y. Halevy,et al.  PQL: a declarative query language over dynamic biological schemata , 2002, AMIA.

[106]  William W. Cohen,et al.  Contextual search and name disambiguation in email using graphs , 2006, SIGIR.

[107]  Arik Azran,et al.  The rendezvous algorithm: multiclass semi-supervised learning with Markov random walks , 2007, ICML '07.

[108]  Rajat Raina,et al.  Constructing informative priors using transfer learning , 2006, ICML.

[109]  A. Volgenant,et al.  Reduction tests for the steiner problem in grapsh , 1989, Networks.

[110]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[111]  Wendy G. Lehnert,et al.  Using Decision Trees for Coreference Resolution , 1995, IJCAI.

[112]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[113]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[114]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[115]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[116]  Soumen Chakrabarti,et al.  Learning to rank networked entities , 2006, KDD '06.

[117]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[118]  Joseph O'Rourke,et al.  Handbook of Discrete and Computational Geometry, Second Edition , 1997 .

[119]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[120]  Jennifer Widom,et al.  Lineage tracing in data warehouses , 2001 .

[121]  Oren Etzioni,et al.  Towards a theory of natural language interfaces to databases , 2003, IUI '03.

[122]  Shankar Kumar,et al.  Video suggestion and discovery for youtube: taking random walks through the view graph , 2008, WWW.

[123]  Richard T. Wong,et al.  A dual ascent approach for steiner tree problems on a directed graph , 1984, Math. Program..

[124]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.