Improving Document Retrieval in Large Domain Specific Textual Databases Using Lexical Resources

Large collections of textual documents represent an example of big data that requires the solution of three basic problems: the representation of documents, the representation of information needs and the matching of the two representations. This paper outlines the introduction of document indexing as a possible solution to document representation. Documents within a large textual database developed for geological projects in the Republic of Serbia for many years were indexed using methods developed within digital humanities: bag-of-words and named entity recognition. Documents in this geological database are described by a summary report, and other data, such as title, domain, keywords, abstract, and geographical location. These metadata were used for generating a bag of words for each document with the aid of morphological dictionaries and transducers. Named entities within metadata were also recognized with the help of a rule-based system. Both the bag of words and the metadata were then used for pre-indexing each document. A combination of several \(tf\_idf\) based measures was applied for selecting and ranking of retrieval results of indexed documents for a specific query and the results were compared with the initial retrieval system that was already in place. In general, a significant improvement has been achieved according to the standard information retrieval performance measures, where the InQuery method performed the best.

[1]  Denis Maurel,et al.  Cascades de transducteurs autour de la reconnaissance des entit´ es nomm´ ees , 2011 .

[2]  Danko Šipka,et al.  A suffix subsumption-based approach to building stemmers and lemmatizers for highly inflectional languages with sparse resource , 2008 .

[3]  Cvetana Krstev,et al.  Hybrid sentiment analysis framework for a morphologically rich language , 2015, Journal of Intelligent Information Systems.

[4]  Maurice Gross,et al.  The Use of Finite Automata in the Lexical Representaion of Natural Language , 1987, Electronic Dictionaries and Automata in Computational Linguistics.

[5]  Michael L. Brodie,et al.  The meaningful use of big data: four perspectives -- four challenges , 2012, SGMD.

[6]  Bojan Furlan,et al.  Semantic similarity of short texts in languages with a deficient natural language processing support , 2013, Decis. Support Syst..

[7]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[8]  Cvetana Krstev,et al.  A system for named entity recognition based on local grammars , 2014, J. Log. Comput..

[9]  Dragan Ivanovic,et al.  A CERIF-compatible research management system based on the MARC 21 format , 2010, Program.

[10]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[11]  Peter Jackson,et al.  Natural language processing for online applications : text retrieval, extraction and categorization , 2002 .

[12]  Nikola Milosevic Stemmer for Serbian language , 2012, ArXiv.

[13]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  W. Bruce Croft,et al.  A loosely-coupled integration of a text retrieval system and an object-oriented database system , 1992, SIGIR '92.

[16]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .