Leveraging Category-based LSI for Patent Retrieval

Latent Semantic Indexing (LSI) has been employed to reduce dimension of indices of documents for similarity search. In this paper, we will describe a method for retrieving conceptually similar patents first by categorizing patent collection and then by applying LSI algorithm multiple times to each category. The main strategy is keeping the algorithm as simple as possible, while achieving the scalability for massive dataset. During the categorization phase, we allow any patent to be classified into multiple categories, which allows patent document overlaps among different categories. Then, for each category, we applied dimensional reduction using LSI to each category into a much lower dimension. Finally, once a query as a collection of claim sentences for a patent is given, we select the most similar category, and return top fifty ranked patent documents as candidates to invalidate the query document.