U-AIDA: a customizable system for named entity recognition, classification, and disambiguation

Recognizing and disambiguating entities such as people, organizations, events or places in natural language text are essential steps for many linguistic tasks such as information extraction and text categorization. A variety of named entity disambiguation methods have been proposed, but most of them focus on Wikipedia as a sole knowledge resource. This focus does not fit all application scenarios, and customization to the respective application domain is crucial. This dissertation addresses the problem of building an easily customizable system for named entity disambiguation. The first contribution is the development of a universal and flexible architecture that supports plugging in different knowledge resources. The second contribution is utilizing the flexible architecture to develop two domain-specific disambiguation systems. The third contribution is the design of a complete pipeline for building disambiguation systems for languages other than English that have poor annotated resources such as Arabic. The fourth contribution is a novel approach that performs fine-grained type classification of names in natural language text.

[1]  Fabian M. Suchanek,et al.  Yago: A Core of Semantic Knowledge Unifying WordNet and Wikipedia , 2007 .

[2]  Asif Ekbal,et al.  Assessing the Challenge of Fine-Grained Named Entity Recognition and Classification , 2010, NEWS@ACL.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Gerhard Weikum,et al.  AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables , 2011, Proc. VLDB Endow..

[5]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[6]  Stephanie Strassel,et al.  Annotation Trees: LDC's customizable, extensible, scalable, annotation infrastructure , 2012, LREC.

[7]  Ganesh Ramakrishnan,et al.  Collective annotation of Wikipedia entities in web text , 2009, KDD.

[8]  Bruno Pouliquen,et al.  JRC-NAMES: A Freely Available, Highly Multilingual Named Entity Resource , 2011, RANLP.

[9]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[10]  Gerhard Weikum,et al.  Named Entity Disambiguation for Resource-Poor Languages , 2015, ESAIR@CIKM.

[11]  Gerhard Weikum,et al.  Dictionary-based Named Entity Recognition , 2013 .

[12]  Suresh Manandhar,et al.  An Unsupervised Method for General Named Entity Recognition and Automated Concept Discovery , 2004 .

[13]  Gerhard Weikum,et al.  AIDArabic A Named-Entity Disambiguation Framework for Arabic Text , 2014, ANLP@EMNLP.

[14]  Gerhard Weikum,et al.  AIDA-Social: Entity Linking on the Social Stream , 2014, ESAIR '14.

[15]  Kemal Oflazer,et al.  Dudley North visits North London: Learning When to Transliterate to Arabic , 2013, HLT-NAACL.

[16]  Fabian M. Suchanek,et al.  YAGO3: A Knowledge Base from Multilingual Wikipedias , 2015, CIDR.

[17]  Rudolf Rosa,et al.  Named entities from Wikipedia for machine translation , 2011, ITAT.

[18]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[19]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[20]  Gerhard Weikum,et al.  Discovering emerging entities with ambiguous names , 2014, WWW.

[21]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[22]  Sören Auer,et al.  AGDISTIS - Graph-Based Disambiguation of Named Entities Using Linked Data , 2014, International Semantic Web Conference.

[23]  Douglas W. Oard,et al.  Building a Cross-Language Entity Linking Collection in Twenty-One Languages , 2011, CLEF.

[24]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[25]  Claudio Giuliano Fine-Grained Classification of Named Entities Exploiting Latent Semantic Kernels , 2009, CoNLL.

[26]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[27]  Yaser Al-Onaizan,et al.  Translating Named Entities Using Monolingual and Bilingual Resources , 2002, ACL.

[28]  Douglas W. Oard,et al.  Cross-Language Entity Linking , 2011, IJCNLP.

[29]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[30]  Gerhard Weikum,et al.  Fine-grained Semantic Typing of Emerging Entities , 2013, ACL.

[31]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[32]  Vladimir Eidelman,et al.  cdec: A Decoder, Alignment, and Learning Framework for Finite- State and Context-Free Translation Models , 2010, ACL.

[33]  Nina Wacholder,et al.  Disambiguation of Proper Names in Text , 1997, ANLP.

[34]  Eduard H. Hovy,et al.  Fine Grained Classification of Named Entities , 2002, COLING.

[35]  Paolo Ferragina,et al.  TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[36]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[37]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[38]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[39]  Sören Auer,et al.  AGDISTIS - Agnostic Disambiguation of Named Entities Using Linked Open Data , 2014, ECAI.

[40]  Gerhard Weikum,et al.  HYENA: Hierarchical Type Classification for Entity Names , 2012, COLING.

[41]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[42]  Valentin I. Spitkovsky,et al.  A Cross-Lingual Dictionary for English Wikipedia Concepts , 2012, LREC.

[43]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[44]  Gerhard Weikum,et al.  HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text , 2013, ACL.

[45]  Daniel S. Weld,et al.  Fine-Grained Entity Recognition , 2012, AAAI.

[46]  Gerhard Weikum,et al.  Adapting AIDA for Tweets , 2014, #MSM.

[47]  Gerhard Weikum,et al.  Big Data Methods for Computational Linguistics , 2012, IEEE Data Eng. Bull..

[48]  Lei Tang,et al.  Large scale multi-label classification via metalabeler , 2009, WWW '09.

[49]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[50]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[51]  Grigorios Tsoumakas,et al.  Introduction to the special issue on learning from multi-label data , 2012, Machine Learning.

[52]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[53]  Vincent Ng,et al.  Inducing Fine-Grained Semantic Classes via Hierarchical and Collective Classification , 2010, COLING.

[54]  Young-Suk Lee Confusion Network for Arabic Name Disambiguation and Transliteration in Statistical Machine Translation , 2014, COLING.

[55]  Gerhard Weikum,et al.  KORE: keyphrase overlap relatedness for entity disambiguation , 2012, CIKM.

[56]  Alexander H. Waibel,et al.  Improving Named Entity Translation Combining Phonetic and Semantic Similarities , 2004, NAACL.

[57]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..