Chinese-keyword fuzzy search and extraction over encrypted patent documents

Cloud storage for information sharing is likely indispensable to the future national defence library in China e.g., for searching national defence patent documents, while security risks need to be maximally avoided using data encryption. Patent keywords are the high-level summary of the patent document, and it is significant in practice to efficiently extract and search the key words in the patent documents. Due to the particularity of Chinese keywords, most existing algorithms in English language environment become ineffective in Chinese scenarios. For extracting the keywords from patent documents, the manual keyword extraction is inappropriate when the amount of files is large. An improved method based on the term frequency-inverse document frequency (TF-IDF) is proposed to auto-extract the keywords in the patent literature. The extracted keyword sets also help to accelerate the keyword search by linking finite keywords with a large amount of documents. Fuzzy keyword search is introduced to further increase the search efficiency in the cloud computing scenarios compared to exact keyword search methods. Based on the Chinese Pinyin similarity, a Pinyin-Gram-based algorithm is proposed for fuzzy search in encrypted Chinese environment, and a keyword trapdoor search index structure based on the n-ary tree is designed. Both the search efficiency and accuracy of the proposed scheme are verified through computer experiments.

[1]  Guoliang Li,et al.  Efficient interactive fuzzy keyword search , 2009, WWW '09.

[2]  Wenfeng Yang Chinese keyword extraction based on max-duplicated strings of the documents , 2002, SIGIR '02.

[3]  Hugo Krawczyk,et al.  HMAC: Keyed-Hashing for Message Authentication , 1997, RFC.

[4]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[5]  Eyal Kushilevitz,et al.  Private information retrieval , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[6]  Marianne Winslett,et al.  Zerber: r-confidential indexing for distributed documents , 2008, EDBT '08.

[7]  Dawn Xiaodong Song,et al.  Practical techniques for searches on encrypted data , 2000, Proceeding 2000 IEEE Symposium on Security and Privacy. S&P 2000.

[8]  Cong Wang,et al.  Efficient verifiable fuzzy keyword search over encrypted data in cloud computing , 2013, Comput. Sci. Inf. Syst..

[9]  Han Yan-hua Automatic extraction of keyword based on word co-occurrence frequency , 2011 .

[10]  Jie Wu,et al.  An Efficient Privacy Preserving Keyword Search Scheme in Cloud Computing , 2009, 2009 International Conference on Computational Science and Engineering.

[11]  Michael Mitzenmacher,et al.  Privacy Preserving Keyword Searches on Remote Encrypted Data , 2005, ACNS.

[12]  Zheng Fang An Approximate String Matching Algorithm for Chinese Information Retrieval Systems , 2007 .

[13]  Rafail Ostrovsky,et al.  Public Key Encryption with Keyword Search , 2004, EUROCRYPT.