In order to solve the problems of the existing Chinese full-text retrieval algorithms in terms of large data, for example, data structure is difficult to expand, not suitable for incremental index, and the retrieval efficiency is low. Based on the traditional inverted index structure, this paper proposes an index structure, which can support large data storage with extension ability and update in real time: block linked-list index structure. Firstly, the new algorithm introduces the management concept of the block unit, the block unit is responsible for the management of the document set, and it creates an index for each term in the master index, then the index linked-list maps the term index and the block unit. This block linked-list index structure can greatly improve the ability of index expansion. Secondly, the main index and document index are both using the fixed-length storage with the same length, the positions of the terms' index information are both stationary in the master index file and the document index file of the block unit. According to the method, it can effectively solve the problems of the incremental index update and improve the update efficiency of the index. Finally, in the experiments, 350000 documents (about 1.46TB data) are randomly selected from the internet corpus (SogouT), which is used for comparing two index algorithms in three aspects, including the capability of initial datasets index creation, multitudinous files update and the retrieval of massive data sets. The results show that the new index algorithm has higher processing performance, especially in the efficiency of updating, with nearly 10% improvement.
[1]
Giovanni Tummarello,et al.
SkipBlock: Self-indexing for Block-Based Inverted List
,
2011,
ECIR.
[2]
Rong Luo,et al.
A compression method for inverted index and its FPGA-based decompression solution
,
2010,
2010 International Conference on Field-Programmable Technology.
[3]
Samarth Shah,et al.
Hash based optimization for faster access to inverted index
,
2016,
2016 International Conference on Inventive Computation Technologies (ICICT).
[4]
Fabrizio Silvestri,et al.
VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming
,
2010,
CIKM.
[5]
Mahima Singh,et al.
Choosing Best Hashing Strategies and Hash Functions
,
2009,
2009 IEEE International Advance Computing Conference.
[6]
V. Glory,et al.
Inverted index compression using Extended Golomb Code
,
2012,
IEEE-International Conference On Advances In Engineering, Science And Management (ICAESM -2012).
[7]
JUSTIN ZOBEL,et al.
Inverted files for text search engines
,
2006,
CSUR.