A VSM-based data mining engine for geoscience documents

With the development of information technology in geosciences, enormous data and documents can not be processed by ordinary methods. Furthermore it is difficult to precisely search the target document quickly. In this paper, we propose the use of vector space model (VSM) for automatic date mining of geosciences documents, and a VSM-based search engine system is designed and implemented, which includes three main components: 1)a word segment structure with two hash tables managing the first and the last words of a geo-item and a Trie tree containing the rest of words; 2) a linear space composited by all related documents which need the calculating of similarity; 3) a vector space module mapping documents to multi-dimensional vector space and comparing keywords with features of documents to decide the similarity. This system can make it convenient in geodata sharing and improves the work process efficiently.