A modified Burrows-Wheeler transformation for case-insensitive search with application to suffix array compression

Summary form only given. The suffix array is a memory-efficient data structure for searching any substring of a text. It is also used for defining the Burrows-Wheeler transformation (BWT), which is the core of block sorting. When a compressed text is decoded, the inverse of BWT, which is faster than forward transformation, is performed and in the process the suffix array of the text is also obtained. This means that we can compress and transfer a text and its suffix array by simply using block sorting. This fact can be used for creating large full-text databases. We propose a modified Burrows-Wheeler transformation. By using our transformation, we obtain a suffix array from a compressed text which can be used for case-insensitive searches. An exact query can be done from the result of a case-insensitive search because we can decode the original text from the compressed text. It is available for case-insensitive and more general character conversions. We call the conversion unification and the text after conversion unified text. The proposed transformation is defined by the suffix array of the unified text. Our transformation is not a permutation of an alphabet followed by the original transformation but a combination of unification and the original transformation. From a compressed text using our transformation we can obtain the original text and the suffix array of the unified text. After decoding we can perform ambiguous searches like case-insensitive search by using the suffix array. Experimental results show that our transformation decreases the compression ratio very little. Though decompression and search takes more time than decoding of the original block sorting plus grep command, finding positions of keywords is quite fast which is available for advanced searches.