Compact representations of character-sets

Programming libraries for text processing, such as those for string- and pattern-matching, require a method for representing sets of characters, such as the set of lower-case Latin letters or the set of numerals. A compact and efficient representation of character sets is especially important with the adoption of Unicode, and its very large domain (over a million code points). This paper studies design criteria for such representations, reviews existing implementations, describes new representations, and provides an experimental comparison of representations on real and synthetic data. The new representations combine the strengths of bitmaps and inversion lists while avoiding the worst-case behavior of both.

[1]  Jeffrey D. Ullman,et al.  A Linear List Merging Algorithm , 2008 .

[2]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[3]  Ronald L. Rivest,et al.  Introduction to Algorithms, 3rd Edition , 2009 .

[4]  A. H. Robinson,et al.  Results of a prototype television bandwidth compression scheme , 1967 .

[5]  Andrew Burke,et al.  Classification of Student Web Queries , 2017, 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC).

[6]  K. Rajeshwari,et al.  Interactive clothes based on IOT using NFC and Mobile Application , 2017, CCWC.

[7]  Himadri Nath Saha,et al.  Recent trends in the Internet of Things , 2017, 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC).

[8]  E. Schmidt,et al.  Lex—a lexical analyzer generator , 1990 .