Efficient Discovery of New Information in Large Text Databases

Intelligence analysts are often faced with large data collections within which information relevant to their interests may be very sparse. Existing mechanisms for searching such data collections present difficulties even when the specific nature of the information being sought is known. Finding unknown information using these mechanisms is very inefficient. This paper presents an approach to this problem, based on iterative application of the technique of latent semantic indexing. In this approach, the body of existing knowledge on the analytic topic of interest is itself used as a query in discovering new relevant information. Performance of the approach is demonstrated on a collection of one million documents. The approach is shown to be highly efficient at discovering new information.