论文信息 - Design and implementation of text filtering with no semantic accidental injury

Design and implementation of text filtering with no semantic accidental injury

Information filtering in Internet refers to finding and filtering the bad words in large-scale web text. The accuracy and efficiency are the main problems of concern. The mixture of Chinese and English text filtering is the research emphasis in this paper. The paper proposes a Chinese and English text filtering algorithm-No Semantic Accidental Injury Filter(NSAIF) algorithm to avoid semantic injury. It's based on Aho-2Corasick (AC) algorithm, but avoids space expansion with dynamic memory allocation. It's applicative for Chinese and English text using one-byte storage. It uses the longest match principle to find the words should be filtered in the trie augmented with failure pointers. It has the good time and space performance in different size of test data sets and has the high theoretical and practical values.

Jia Liu | Fangchun Yang | Danfeng Yan

[1] Udi Manber,et al. A FAST ALGORITHM FOR MULTI-PATTERN SEARCHING , 1999 .

[2] Alfred V. Aho,et al. Efficient string matching , 1975, Commun. ACM.

[3] Deng Gui-shi. A New Multi-pattern Matching Algorithm of Intrusion Detection , 2006 .

[4] Edward A. Fox,et al. Practical minimal perfect hash functions for large databases , 1992, CACM.

[5] Robert S. Boyer,et al. A fast string searching algorithm , 1977, CACM.

[6] Keh-Yih Su,et al. An Efficient Algorithm for Matching Multiple Patterns , 1993, IEEE Trans. Knowl. Data Eng..

[7] Daniel Sunday,et al. A very fast substring search algorithm , 1990, CACM.

[8] G. Navarro,et al. Flexible Pattern Matching in Strings: Approximate matching , 2002 .

[9] Cyril Allauzen,et al. Simple Optimal String Matching Algorithm , 2000, J. Algorithms.

[10] Shen Zhou. A Fast Multiple Pattern Algorithm for Chinese String Matching , 2001 .

[11] Timothy Sherwood,et al. A high throughput string matching architecture for intrusion detection and prevention , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).