Building Structured Databases of Factual Knowledge from Massive Text Corpora

In today's computerized and information-based society, people are inundated with vast amounts of text data, ranging from news articles, social media post, scientific publications, to a wide range of textual information from various domains (corporate reports, advertisements, legal acts, medical reports). To turn such massive unstructured text data into structured, actionable knowledge, one of the grand challenges is to gain an understanding of the factual information (e.g., entities, attributes, relations) in the text. In this tutorial, we introduce data-driven methods on mining structured facts (i.e., entities and their relations/attributes for types of interest) from massive text corpora, to construct structured databases of factual knowledge (called StructDBs). State-of-the-art information extraction systems have strong reliance on large amounts of task/corpus-specific labeled data (usually created by domain experts). In practice, the scale and efficiency of such a manual annotation process are rather limited, especially when dealing with text corpora of various kinds (domains, languages, genres). We focus on methods that are minimally-supervised, domain-independent, and language-independent for timely StructDB construction across various application domains (news, social media, biomedical, business), and demonstrate on real datasets how these StructDBs aid in data exploration and knowledge discovery.

[1]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[2]  Daniel Jurafsky,et al.  Do Multi-Sense Embeddings Improve Natural Language Understanding? , 2015, EMNLP.

[3]  Daniel S. Weld,et al.  Fine-Grained Entity Recognition , 2012, AAAI.

[4]  Paul Deane,et al.  A Nonparametric Method for Extraction of Candidate Phrasal Terms , 2005, ACL.

[5]  Alessandro Moschitti,et al.  Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees , 2006, ECML.

[6]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[7]  Yeye He,et al.  SEISA: set expansion by iterative similarity aggregation , 2011, WWW.

[8]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction by Bridging Vocabulary Gap , 2011, CoNLL.

[9]  Heng Ji,et al.  Unsupervised Person Slot Filling based on Graph Mining , 2016, ACL.

[10]  Rares Vernica,et al.  Entity categorization over large document collections , 2008, KDD.

[11]  Aditya G. Parameswaran,et al.  Towards the web of concepts , 2010, Proc. VLDB Endow..

[12]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[13]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[14]  ChengXiang Zhai,et al.  Noun-Phrase Analysis in Unrestricted Text for Information Retrieval , 1996, ACL.

[15]  Jiawei Han,et al.  Mining latent entity structures from massive unstructured and interconnected data , 2014, SIGMOD Conference.

[16]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[17]  Heng Ji,et al.  Successful Data Mining Methods for NLP , 2015, ACL.

[18]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, The VLDB Journal.

[19]  Nguyen Bach,et al.  A Review of Relation Extraction , 2007 .

[20]  Jiawei Han,et al.  MetaPAD: Meta Pattern Discovery from Massive Text Corpora , 2017, KDD.

[21]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[22]  Christopher Ré,et al.  Extracting Databases from Dark Data with DeepDive , 2016, SIGMOD Conference.

[23]  Branimir Boguraev,et al.  Automatic Glossary Extraction: Beyond Terminology Identification , 2002, COLING.

[24]  Jiawei Han,et al.  Automated Phrase Mining from Massive Text Corpora , 2017, IEEE Transactions on Knowledge and Data Engineering.

[25]  Philip S. Yu,et al.  Mining Knowledge from Interconnected Data: A Heterogeneous Information Network Analysis Approach , 2012, Proc. VLDB Endow..

[26]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[27]  Beng Chin Ooi,et al.  Automatic discovery of attributes in relational databases , 2011, SIGMOD '11.

[28]  Heng Ji,et al.  CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases , 2016, WWW.

[29]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[30]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[31]  C. Elkan,et al.  Topic Models , 2008 .

[32]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[33]  Yizhou Sun,et al.  Mining heterogeneous information networks: a structural analysis approach , 2013, SKDD.

[34]  Uri Zernik,et al.  Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[35]  Clare R. Voss,et al.  ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering , 2015, KDD.

[36]  Dmitry Zelenko,et al.  Kernel Methods for Relation Extraction , 2002, J. Mach. Learn. Res..

[37]  Xiang Ren,et al.  Automatic Entity Recognition and Typing in Massive Text Corpora , 2016, WWW.

[38]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[39]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[40]  Mohammed J. Zaki,et al.  Mining Attribute-structure Correlated Patterns in Large Attributed Graphs , 2012, Proc. VLDB Endow..

[41]  Mohamed Yahya,et al.  ReNoun: Fact Extraction for Nominal Attributes , 2014, EMNLP.

[42]  Ellen Riloff,et al.  Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing , 2010, ACL.

[43]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[44]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[45]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[46]  Christopher D. Manning,et al.  Improved Pattern Learning for Bootstrapped Entity Extraction , 2014, CoNLL.

[47]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[48]  Jiawei Han,et al.  Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents , 2014, SDM.

[49]  Hans-Peter Kriegel,et al.  Extraction of semantic biomedical relations from text using conditional random fields , 2008, BMC Bioinformatics.

[50]  Rahul Gupta,et al.  Biperpedia: An Ontology for Search Applications , 2014, Proc. VLDB Endow..

[51]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[52]  James Mayfield,et al.  Entity Extraction without Language-Specific Resources , 2002, CoNLL.

[53]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[54]  Heng Ji,et al.  Incremental Joint Extraction of Entity Mentions and Relations , 2014, ACL.

[55]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[56]  Divesh Srivastava,et al.  Type-based categorization of relational attributes , 2009, EDBT '09.

[57]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[58]  Heng Ji,et al.  AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding , 2016, EMNLP.

[59]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[60]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[61]  Rayid Ghani,et al.  Text mining for product attribute extraction , 2006, SKDD.

[62]  Changning Huang,et al.  A Unified Statistical Model for the Identification of English BaseNP , 2000, ACL.

[63]  Rich Caruana,et al.  Classification with partial labels , 2008, KDD.

[64]  Tom M. Mitchell,et al.  Never-ending language learning , 2014, Big Data 2014.

[65]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[66]  Razvan C. Bunescu,et al.  Learning to Extract Relations from the Web using Minimal Supervision , 2007, ACL.

[67]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[68]  Dan Roth,et al.  The Use of Classifiers in Sequential Inference , 2001, NIPS.

[69]  Yang Li,et al.  Mining evidences for named entity disambiguation , 2013, KDD.

[70]  Andrew McCallum,et al.  First-Order Probabilistic Models for Coreference Resolution , 2007, NAACL.

[71]  Benny Kimelfeld,et al.  Database principles in information extraction , 2014, PODS.

[72]  Eduard H. Hovy,et al.  When Are Tree Structures Necessary for Deep Learning of Representations? , 2015, EMNLP.

[73]  Taylor Cassidy,et al.  The Wisdom of Minority: Unsupervised Slot Filling Validation based on Multi-dimensional Truth-Finding , 2014, COLING.

[74]  Christopher D. Manning,et al.  Leveraging Linguistic Structure For Open Domain Information Extraction , 2015, ACL.

[75]  Heng Ji,et al.  Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding , 2016, KDD.

[76]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[77]  Christopher D. Manning,et al.  Stanford's Distantly Supervised Slot Filling Systems for KBP 2014 , 2014 .

[78]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[79]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[80]  Qiaozhu Mei,et al.  PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks , 2015, KDD.

[81]  Jian Su,et al.  Exploring Various Knowledge in Relation Extraction , 2005, ACL.

[82]  Seung-won Hwang,et al.  Attribute extraction and scoring: A probabilistic approach , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[83]  Dongyan Zhao,et al.  Natural language question answering over RDF: a graph data driven approach , 2014, SIGMOD Conference.

[84]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[85]  Bowen Zhou,et al.  Classifying Relations by Ranking with Convolutional Neural Networks , 2015, ACL.

[86]  Xiao Yu,et al.  Discovering Structure in the Universe of Attribute Names , 2016, WWW.

[87]  Katerina T. Frantzi,et al.  Automatic recognition of multi-word terms , 1998 .

[88]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[89]  Gerhard Weikum,et al.  PATTY: A Taxonomy of Relational Patterns with Semantic Types , 2012, EMNLP.

[90]  Divesh Srivastava,et al.  DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web , 2015, Proc. VLDB Endow..

[91]  Heng Ji,et al.  Automatic Entity Recognition and Typing in Massive Text Data , 2016, SIGMOD Conference.

[92]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[93]  Barbara Rosario,et al.  Classifying Semantic Relations in Bioscience Texts , 2004, ACL.

[94]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[95]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[96]  Nevena Lazic,et al.  Embedding Methods for Fine Grained Entity Type Classification , 2015, ACL.

[97]  Luciano Del Corro,et al.  ClausIE: clause-based open information extraction , 2013, WWW.

[98]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.