High performance SIMD text processing using the method of parallel bit streams is introduced with a case study of UTF-8 to UTF-16 transcoding. A forward transform converts byte-oriented character stream data into eight parallel bit streams. Decoding, validation and computation of UTF-8 indexed UTF-16 bit streams are performed using bit-parallel logic and shifting operations. Conversion from UTF-8 indexing to UTF-16 indexing is performed using parallel bit deletion. The inverse transform is applied to yield high and low UTF-16 byte streams which are then merged. Combined with optimization techniques for blocks of ASCII data, speed-ups of 3 to 25 times are achieved on commodity processors compared with optimized byte-at-a-time code. Further applications of the method of parallel bit streams to bulk text processing applications are briefly discussed along with future prospects for the combination of intraregister and intrachip parallelism on multicore processors.
[1]
Abraham Heifets,et al.
XML screamer: an integrated approach to high performance XML parsing, validation and deserialization
,
2006,
WWW '06.
[2]
XML parsing: a threat to database performance
,
2003,
CIKM '03.
[3]
Henry S. Warren,et al.
Hacker's Delight
,
2002
.
[4]
Giuseppe Psaila.
On the Problem of Coupling Java Algorithms and XML Parsers (Invited Paper)
,
2006,
17th International Workshop on Database and Expert Systems Applications (DEXA'06).
[5]
Noah,et al.
Performance Analysis of XML APIs
,
2006
.