Extending the boolean and vector space models of information retrieval with p-norm queries and multiple concept types

Information retrieval systems accept user queries and respond by identifying documents presumed to be relevant to those queries. Boolean logic is commonly employed as the query language, but an alternate scheme based on the vector space model has been investigated. Automatic indexing methods allow questions and documents to be represented by weighted vectors, and searching yields a ranked list of documents in decreasing order of query-document similarity. Both the Boolean logic and vector space approaches are special cases of the very general p-norm model developed by Wu. In this thesis, analytical results help explain the ranking behavior of p-norm queries. Four experimental test collections are employed to prove that interpreting Boolean queries with p-norm techniques leads to substantial improvements in retrieval effectiveness. Several procedures are described to automatically construct Boolean queries from short lists of words. An algorithm based on probability theory which utilizes relevance feedback information to produce even more effective Boolean or p-norm queries is also specified and validated by experimental tests; one of the essential innovations is to control query construction by allowing the searcher to specify the desired number of retrieved documents. The vector space model is extended to allow separate handling of other types of concepts beyond those derived from textual words or assigned descriptors. Citation, cocitation, and bibliographic coupling data as well as factual information normally handled by database management systems are easily incorporated. Clustered search and feedback processes are improved by utilizing combinations of these concept types, as shown by preliminary experimental studies with collections in computer and information science. P-norm queries and multiple concept types are both handled by an updated SMART system implemented using relational database and statistical processing packages. This integrated system should facilitate further research as well as be adaptable to retrieval applications for office and bibliographic information.