论文信息 - Language classification using n-grams accelerated by FPGA-based Bloom filters

Language classification using n-grams accelerated by FPGA-based Bloom filters

N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.

Maya Gokhale | Arpith C. Jacob

[1] John W. Lockwood,et al. HAIL: a hardware-accelerated algorithm for language identification , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[2] John W. Lockwood,et al. Deep packet inspection using parallel bloom filters , 2004, IEEE Micro.

[3] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[4] Joseph M. Lancaster,et al. Biosequence similarity search on the Mercury system , 2004, Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004..

[5] M. V. Ramakrishna,et al. Efficient Hardware Hashing Functions for High Performance Computers , 1997, IEEE Trans. Computers.

[6] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[7] Tomaz Erjavec,et al. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[8] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.