Automatic Entity Recognition and Typing in Massive Text Corpora

In today's computerized and information-based society, we are soaked with vast amounts of natural language text data, ranging from news articles, product reviews, advertisements, to a wide range of user-generated content from social media. To turn such massive unstructured text data into actionable knowledge, one of the grand challenges is to gain an understanding of entities and the relationships between them. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in different kinds of text corpora (especially in massive, domain-specific text corpora). These methods can automatically identify token spans as entity mentions in text and label their types (e.g., people, product, food) in a scalable way. We demonstrate on real datasets including news articles and yelp reviews how these typed entities aid in knowledge discovery and management.

[1]  Partha Pratim Talukdar,et al.  Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition , 2010, ACL.

[2]  Zornitsa Kozareva,et al.  Class Label Enhancement via Related Instances , 2011, EMNLP.

[3]  Rares Vernica,et al.  Entity categorization over large document collections , 2008, KDD.

[4]  William W. Cohen,et al.  WebSets: extracting sets of entities from the web using unsupervised information extraction , 2012, WSDM '12.

[5]  Mitul Tiwari,et al.  Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach , 2013, Proc. VLDB Endow..

[6]  Peter Z. Yeh,et al.  Joint Extraction and Labeling via Graph Propagation for Dictionary Construction , 2013, AAAI.

[7]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[8]  Heng Ji,et al.  Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media , 2013, ACL.

[9]  Yeye He,et al.  SEISA: set expansion by iterative similarity aggregation , 2011, WWW.

[10]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[11]  Yizhou Sun,et al.  Mining heterogeneous information networks: a structural analysis approach , 2013, SKDD.

[12]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[13]  Oren Etzioni,et al.  No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities , 2012, EMNLP.

[14]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[15]  Peng Li,et al.  Joint topic modeling for event summarization across news and social media streams , 2012, CIKM.

[16]  Daniel S. Weld,et al.  Fine-Grained Entity Recognition , 2012, AAAI.

[17]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[18]  Tong Zhang,et al.  A High-Performance Semi-Supervised Learning Method for Text Chunking , 2005, ACL.

[19]  Ralph Grishman,et al.  Unsupervised Learning of Generalized Names , 2002, COLING.

[20]  Gerhard Weikum,et al.  Fine-grained Semantic Typing of Emerging Entities , 2013, ACL.

[21]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[22]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[23]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[24]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[25]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[26]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[27]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[28]  Wei Shen,et al.  A graph-based approach for ontology population with named entities , 2012, CIKM '12.

[29]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[30]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[31]  Bu-Sung Lee,et al.  TwiNER: named entity recognition in targeted twitter stream , 2012, SIGIR '12.

[32]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[33]  Dekang Lin,et al.  Phrase Clustering for Discriminative Learning , 2009, ACL.

[34]  Thorsten Brants,et al.  A Context Pattern Induction Method for Named Entity Extraction , 2006, CoNLL.

[35]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[36]  Clare R. Voss,et al.  ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering , 2015, KDD.

[37]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[38]  Yuanzhuo Wang,et al.  Populating knowledge base with collective entity mentions: A graph-based approach , 2014, 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014).

[39]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[40]  Heng Ji,et al.  Knowledge Base Population: Successful Approaches and Challenges , 2011, ACL.

[41]  Ellen Riloff,et al.  Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing , 2010, ACL.

[42]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[43]  Christopher D. Manning,et al.  Improved Pattern Learning for Bootstrapped Entity Extraction , 2014, CoNLL.

[44]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[45]  Ralph Grishman,et al.  Bootstrapped Learning of Semantic Classes from Positive and Negative Examples , 2003 .

[46]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[47]  James Mayfield,et al.  Entity Extraction without Language-Specific Resources , 2002, CoNLL.

[48]  Heng Ji,et al.  Incremental Joint Extraction of Entity Mentions and Relations , 2014, ACL.

[49]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.