Entity-based information retrieval

The goal of this master’s thesis is to extract structured semantic information from Swedish text documents. This information is then intended to be stored as metadata alongside the original text document. Making this metadata searchable should then create a more powerful search engine, both in the sense that it allows for more complex queries and it can be used to give more relevant results. To achieve this, we explored different methods of carrying out named entity recognition. The named entities found in a text can then be used to extract structured information from semantic networks. We associated a unique identifier in a semantic network with each found named entity. The result of this thesis is a program that takes plain text as input and outputs structured information about the entities in the text. We have evaluated the performance of the different parts of the program and compared it to existing systems.

[1]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[2]  Gerhard Weikum,et al.  YAGO2: exploring and querying world knowledge in time, space, context, and many languages , 2011, WWW.

[3]  Thomas S. Morton,et al.  Taming Text: How to Find, Organize, and Manipulate It , 2013 .

[4]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[5]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[6]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[7]  Xavier Carreras,et al.  Named Entity Extraction using AdaBoost , 2002, CoNLL.

[8]  Marine Carpuat,et al.  Boosting for Named Entity Recognition , 2002, CoNLL.

[9]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[10]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[11]  Dean Allemang,et al.  Semantic Web for the Working Ontologist - Effective Modeling in RDFS and OWL, Second Edition , 2011 .

[12]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[13]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Jonas Sj̈obergh Combining POS-taggers for improved accuracy on Swedish text , 2003 .

[16]  Johan Carlberger,et al.  Implementing an Efficient Part-Of-Speech Tagger , 1999, Softw. Pract. Exp..