Mining intelligence of crowds for knowledge inference

Consider a query to a search engine, "apple orange tie up europe," or a sentence in a document, "SDI gained the popular name star wars after the 1977 film by George Lucas," buried in data sets of an information processing system. Given such a succinct representation, how does the system infer that the apple and orange in the query context refer to apple and orange in the sense of a corporation or that the indexed document could be relevant for the query "Strategic Defense Initiative"? If the underlying data happens to be in the form of video or pictures, is it possible for a computer to actually understand the contents of the image to be able to deliver relevant results for user's queries in the form of keywords? Algorithms we present can help large information processing systems do just that. When humans process texts, they interpret and express information in the context of their background knowledge and experience. Information retrieval systems, on the other hand, are restricted to learning statistical models based on low level information like individual words or pixel regions from a limited training set. In this thesis, we propose to enrich this low level representation with world knowledge. We achieve this using vast digital repositories of human knowledge in the form of the encyclopedia Wikipedia. Wikipedia articles are used as explicit knowledge concepts. We demonstrate that the inference methodology based on the world knowledge in Wikipedia repositories can be applied for applications on Web scale. To this end, we apply the proposed solutions for the task of query expansion to improve the relevance of search results for difficult queries. Based on the work in the thesis, we built a state-of-the-art system called "Do You Mean?," which generates a ranked forest of concepts for a given ambiguous word in near real time. Finally, we explore the possibilities of using knowledge repositories for expressing image information in the form of knowledge concepts. We build an affinity matrix for over 22 000 concepts and 135 000 images. Distributed infrastructure for handling terabytes of data is laid down and initial results are reported for the proposed algorithm.