On applying formal grammar and languages, and deduction to information retrieval modelling

The paper applies formal methods (deduction, grammar) to some aspects of information retrieval. A formal definition of information retrieval is given as establishing a measure of a relation between documents and a user model. The user model consists of the query and additional information on user, which is partly stored and partly deduced based on the stored data and a general rule base. The retrieval system should then answer the user model rather than the query. Further, it is shown that the set of documents represented in a normal form is recursive, which makes it possible to design an additional validation processor in order to check the format of new documents being uploaded. It is also shown that the formal correctness of the query does not necessarily imply its positive answerability. 1 User model and deduction 1.1 Information need Information Retrieval (IR) is concerned with the organisation, storage, retrieval, and evaluation of information relevant to a user’s information need. The main components of IR are as follows: user; information need; request; query; information stored in computer(s); appropriate computer programs. The user has an information need (i.e., wants to find out something, is looking for information on something; e.g., articles published on a certain subject, books written by an author, banks offering online banking services, travel agencies with last minute offers, etc.). The information need is formulated in a request for information, in natural language. The request is then expressed in the form of a query, in a form that is required by the computer programs (e.g., according to the syntax of a query language). These programs retrieve information in response to a query, e.g., they return database records, journal articles, WWW (World Wide Web) pages, etc.. This is the reason why, mainly in practice, IR can also be viewed as a system, and the term Information Retrieval System (IRS) is also used. If a user, say U, is interested in journal articles and/or authors on, e.g., ‘mathematical methods and techniques used in information retrieval’ then this is the user’s information need; let us denote it by IN. The information need IN is re– formulated in a form accepted by the search processor (engine); it thus becomes a query, say Q. 1.2 Information retrieval without hidden information Information is stored in computer databases. More generally, information is stored in entities which may be generically referred to as objects O, e.g., abstracts, articles, images, sounds, etc.; these are traditionally called documents. The objects should be suitably represented, in such a way that they can be subjected to appropriate algorithms and computer programs. The same holds for queries, too. The overall aim of an IR system is to or try to return information which is relevant to the user, i.e., information that is useful, meaningful. Thus IR may be re–formulated symbolically  or formally  as a 4–tuple yielding retrieved objects as follows: IR = (U, IN, Q, O ) → R 1.3. Implicit (hidden) information The information need IN is more than its expression as a query Q: IN comprises query Q plus additional information about user U. This additional information is specific to the user: spoken languages, fields of interest, preferred journals, specialisation, profession, most frequently used queries, etc.. The importance of additional information consists in that it is one factor in the judgment of relevance, when judging whether a retrieved object is relevant or not. For example, the same search term PROGRAM has different meanings for a computer programmer (meaning a text written in the C programming language and solving a differential equation) and for a conference organiser (meaning a structure and sequence of scientific and social events during the conference). The additional information is obvious for the user (he/she implicitly assumes it) but not for the computer. Thus we may term this additional information as being an implicit information I specific to the user U, and we may write: IN = (Q, I) Thus the meaning of the concept of IR can be re–formulated as being concerned with finding an appropriate  relevance  relationship, say R, between objects O and information need IN; symbolically: IR = R(O, IN) = R(O, (Q, I)) 1.4 Information retrieval with hidden information In order for an IR system to find such a relation R it should be made possible to take into account the implicit information I as well, and ideally the information which can be deduced (inferred) from I to obtain as complete a picture of user U as possible. Thus finding an appropriate relation R would mean obtaining (deriving, inferring) those objects O which match the meaning of the query Q and satisfy the implicit information I. With these IR becomes: IR = R(O, (Q, 〈I, |→〉) where 〈I, |→〉 means I plus information derivable (e.g., in some language or logic) or inferred or deduced from I. Of course, the relation R is established with some (un)certainty m; thus: IR = m[R(O, (Q, 〈I, |→〉))] There is a rich literature on user modelling. Based on [4], [5], we give a small example to render a possible meaning of 〈I, |→〉. The user's implicit information I may be (stored permanently, and updated as necessary). Consider, for example, the following user: Identifier: U100 Name: UserOneHundred Languages spoken: Hungarian, Age: 24 Computer skills: payroll software Profession: Secretary A rule base to deduce additional information from I may be: IF (user has retrieval experience) THEN (user is skilled AND likes shortcuts AND familiar with Boolean expressions) IF (user has no OR less retrieval experience) THEN (user is a beginner AND prefers menus) IF (user is a child) THEN (user likes more colours and few text) IF (user does not speak English) THEN (do not return hits in the English language)