论文信息 - Web relation extraction with distant supervision

Web relation extraction with distant supervision

Being able to find relevant information about prominent entities quickly is the main reason to use a search engine. However, with large quantities of information on the World Wide Web, real time search over billions of Web pages can waste resources and the end user’s time. One of the solutions to this is to store the answer to frequently asked general knowledge queries, such as the albums released by a musical artist, in a more accessible format, a knowledge base. Knowledge bases can be created and maintained automatically by using information extraction methods, particularly methods to extract relations between proper names (named entities). A group of approaches for this that has become popular in recent years are distantly supervised approaches as they allow to train relation extractors without text-bound annotation, using instead known relations from a knowledge base to heuristically align them with a large textual corpus from an appropriate domain. This thesis focuses on researching distant supervision for the Web domain. A new setting for creating training and testing data for distant supervision from the Web with entity-specific search queries is introduced and the resulting corpus is published. Methods to recognise noisy training examples as well as methods to combine extractions based on statistics derived from the background knowledge base are researched. Using co-reference resolution methods to extract relations from sentences which do not contain a direct mention of the subject of the relation is also investigated. One bottleneck for distant supervision for Web data is identified to be named entity recognition and classification (NERC), since relation extraction methods rely on it for identifying relation arguments. Typically, existing pre-trained tools are used, which fail in diverse genres with non-standard language, such as the Web genre. The thesis explores what can cause NERC methods to fail in diverse genres and quantifies different reasons for NERC failure. Finally, a novel method for NERC for relation extraction is proposed based on the idea of jointly training the named entity classifier and the relation extractor with imitation learning to reduce the reliance on external NERC tools. This thesis improves the state of the art in distant supervision for knowledge base population, and sheds light on and proposes solutions for issues arising for information extraction for not traditionally studied domains.

Isabelle Augenstein | Isabelle Augenstein

[1] Stephen Clark,et al. A New Corpus and Imitation Learning Framework for Context-Dependent Semantic Parsing , 2014, TACL.

[2] William W. Cohen,et al. WebSets: extracting sets of entities from the web using unsupervised information extraction , 2012, WSDM '12.

[3] Isabelle Augenstein,et al. Statistical Knowledge Patterns: Identifying Synonymous Relations in Large Linked Datasets , 2013, International Semantic Web Conference.

[4] Luke S. Zettlemoyer,et al. Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[5] Mark Dredze,et al. Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[6] Peter Glöckner,et al. Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[7] Jordan L. Boyd-Graber,et al. Don't Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation , 2014, EMNLP.

[8] Estevam R. Hruschka,et al. Conversing Learning: Active Learning and Active Social Interaction for Human Supervision in Never-Ending Learning Systems , 2012, IBERAMIA.

[9] Thomas Demeester,et al. Using active learning and semantic clustering for noise reduction in distant supervision , 2014, NIPS 2014.

[10] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[11] Raphaël Troncy,et al. Analysis of named entity recognition and linking for tweets , 2014, Inf. Process. Manag..

[12] Kalina Bontcheva,et al. Generalisation in named entity recognition: A quantitative analysis , 2017, Comput. Speech Lang..

[13] Mitchell P. Marcus,et al. OntoNotes: The 90% Solution , 2006, NAACL.

[14] Aba-Sah Dadzie,et al. Proceedings of the 2nd Workshop on Making Sense of Microposts (#MSM2012):Big things come in small packages , 2012 .

[15] Isabelle Augenstein,et al. Statistical Knowledge Patterns for Characterising Linked Data , 2013, WOP.

[16] Isabelle Augenstein,et al. Relation Extraction from the Web Using Distant Supervision , 2014, EKAW.

[17] Alessandro Moschitti,et al. Joint Distant and Direct Supervision for Relation Extraction , 2011, IJCNLP.

[18] Igor Kononenko,et al. Cost-Sensitive Learning with Neural Networks , 1998, ECAI.

[19] Dan Roth,et al. A Linear Programming Formulation for Global Inference in Natural Language Tasks , 2004, CoNLL.

[20] Thomas G. Dietterich. Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[21] Dietrich Klakow,et al. RelationFactory: A Fast, Modular and Effective System for Knowledge Base Population , 2014, EACL.

[22] Giuseppe Attardi. DeepNL: a Deep Learning NLP pipeline , 2015, VS@HLT-NAACL.

[23] Kalina Bontcheva,et al. Using @Twitter Conventions to Improve #LOD-Based Named Entity Disambiguation , 2015, ESWC.

[24] Xuchen Yao,et al. Information Extraction over Structured Data: Question Answering with Freebase , 2014, ACL.

[25] Christopher D. Manning. Computational Linguistics and Deep Learning , 2015, Computational Linguistics.

[26] Kentaro Torisawa,et al. Acquiring Hyponymy Relations from Web Documents , 2004, NAACL.

[27] Zhi-Hua Zhou,et al. ON MULTI‐CLASS COST‐SENSITIVE LEARNING , 2006, Comput. Intell..

[28] Nguyen Bach,et al. A Review of Relation Extraction , 2007 .

[29] Ion Androutsopoulos,et al. Extractive Multi-Document Summarization with Integer Linear Programming and Support Vector Regression , 2012, COLING.

[30] Sergey Brin,et al. Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[31] Isabelle Augenstein,et al. Distantly supervised Web relation extraction for knowledge base population , 2016, Semantic Web.

[32] Jens Lehmann,et al. DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[33] Isabelle Augenstein. Seed Selection for Distantly Supervised Web-Based Relation Extraction , 2014, SWAIE@COLING.

[34] Ralph Grishman,et al. Ensemble Semantics for Large-scale Unsupervised Relation Extraction , 2012, EMNLP.

[35] Jun'ichi Tsujii,et al. A Markov Logic Approach to Bio-Molecular Event Extraction , 2009, BioNLP@HLT-NAACL.

[36] Kalina Bontcheva,et al. USFD: Twitter NER with Drift Compensation and Linked Data , 2015, NUT@IJCNLP.

[37] Ramesh Nallapati,et al. Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition , 2008, ACL.

[38] Oren Etzioni,et al. TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[39] Matthew Richardson,et al. Markov Logic , 2008, Probabilistic Inductive Logic Programming.

[40] Ming-Wei Chang,et al. Learning and Inference with Constraints , 2008, AAAI.

[41] Andrew McCallum,et al. Relation Extraction with Matrix Factorization and Universal Schemas , 2013, NAACL.

[42] Donald Geman,et al. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[43] Dietrich Klakow,et al. Combining Generative and Discriminative Model Scores for Distant Supervision , 2013, EMNLP.

[44] Diego Reforgiato Recupero,et al. Uncovering the Semantics of Wikipedia Pagelinks , 2014, EKAW.

[45] Satoshi Sekine,et al. A survey of named entity recognition and classification , 2007 .

[46] Ramesh Nallapati,et al. Multi-instance Multi-label Learning for Relation Extraction , 2012, EMNLP.

[47] Oren Etzioni,et al. Open Language Learning for Information Extraction , 2012, EMNLP.

[48] Heng Ji,et al. Incremental Joint Extraction of Entity Mentions and Relations , 2014, ACL.

[49] Koby Crammer,et al. Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[50] Wei Zhang,et al. Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[51] Wai Lam,et al. Jointly Identifying Entities and Extracting Relations in Encyclopedia Text via A Graphical Model Approach , 2010, COLING.

[52] Heng Ji,et al. Improving Name Tagging by Reference Resolution and Relation Detection , 2005, ACL.

[53] Oren Etzioni,et al. Open question answering over curated and extracted knowledge bases , 2014, KDD.

[54] Kalina Bontcheva,et al. GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[55] Dan Roth,et al. Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[56] Yorick Wilks,et al. Information Extraction: Beyond Document Retrieval , 1998, Int. J. Comput. Linguistics Chin. Lang. Process..

[57] Guillaume Bouchard,et al. Accelerating Stochastic Gradient Descent via Online Learning to Sample , 2015, ArXiv.

[58] Kalina Bontcheva,et al. USFD at SemEval-2016 Task 6: Any-Target Stance Detection on Twitter with Autoencoders , 2016, *SEMEVAL.

[59] Andreas Vlachos,et al. Search-based Structured Prediction applied to Biomedical Event Extraction , 2011, CoNLL.

[60] Christiane Fellbaum,et al. Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[61] Pierre Nugues,et al. A Distant Supervision Approach to Semantic Role Labeling , 2015, *SEMEVAL.

[62] Le Zhao,et al. Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction , 2013, ACL.

[63] Sampo Pyysalo,et al. Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[64] Yang Liu,et al. Exploring Fine-grained Entity Type Constraints for Distantly Supervised Relation Extraction , 2014, COLING.

[65] Omer Levy,et al. Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[66] Timothy Baldwin,et al. Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[67] Razvan C. Bunescu,et al. Subsequence Kernels for Relation Extraction , 2005, NIPS.

[68] Claire Cardie,et al. Joint Inference for Fine-grained Opinion Extraction , 2013, ACL.

[69] Kalina Bontcheva,et al. Towards a semantic extraction of named entities , 2003 .

[70] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[71] Haotian Sun,et al. ANNALIST - ANNotation ALIgnment and Scoring Tool , 2008, LREC.

[72] Shourya Roy,et al. A survey of types of text noise and techniques to handle noisy text , 2009, AND '09.

[73] Chang Wang,et al. Relation extraction and scoring in DeepQA , 2012, IBM J. Res. Dev..

[74] Jason Weston,et al. Learning Structured Embeddings of Knowledge Bases , 2011, AAAI.

[75] Miao Fan,et al. Distant Supervision for Entity Linking , 2015, PACLIC.

[76] Yoshua Bengio,et al. Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[77] Gerhard Weikum,et al. YAGO2: exploring and querying world knowledge in time, space, context, and many languages , 2011, WWW.

[78] 悠太菊池,et al. 大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[79] Heeyoung Lee,et al. Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules , 2013, CL.

[80] Andrew McCallum,et al. Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[81] Ryan Gabbard,et al. Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters , 2011, ACL.

[82] Dirk Hovy,et al. Adapting taggers to Twitter with not-so-distant supervision , 2014, COLING.

[83] Heng Ji,et al. Knowledge Base Population: Successful Approaches and Challenges , 2011, ACL.

[84] See-Kiong Ng,et al. Negative Training Data Can be Harmful to Text Classification , 2010, EMNLP.

[85] Daniel Jurafsky,et al. Distant supervision for relation extraction without labeled data , 2009, ACL.

[86] Mark Stevenson,et al. Self-supervised Relation Extraction Using UMLS , 2014, CLEF.

[87] Estevam R. Hruschka,et al. Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[88] Edward Y. Chang,et al. Entity Disambiguation with Freebase , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[89] Gerhard Weikum,et al. YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[90] Frederick Reiss,et al. Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks , 2010, EMNLP.

[91] Nan Ye,et al. Domain adaptive bootstrapping for named entity recognition , 2009, EMNLP.

[92] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[93] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[94] Andrew McCallum,et al. Learning Extractors from Unlabeled Text using Relevant Databases , 2007 .

[95] Stuart J. Russell. Learning agents for uncertain environments (extended abstract) , 1998, COLT' 98.

[96] Dirk Hovy,et al. Crowdsourcing and annotating NER for Twitter #drift , 2014, LREC.

[97] Xian Wu,et al. Domain Adaptation with Latent Semantic Association for Named Entity Recognition , 2009, NAACL.

[98] Yoshua Bengio,et al. Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[99] Dirk Hovy,et al. User Review Sites as a Resource for Large-Scale Sociolinguistic Studies , 2015, WWW.

[100] David Y. W. Lee,et al. Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .