Real-time data pre-processing technique for efficient feature extraction in large scale datasets

Due to the continuous and rampant increase in the size of domain specific data sources, there is a real and sustained need for fast processing in time-sensitive applications, such as medical record information extraction at the point of care, genetic feature extraction for personalized treatment, as well as off-line knowledge discovery such as creating evidence based medicine. Since parallel multi-string matching is at the core of most data mining tasks in these applications, faster on-line matching in static and streaming data is needed to improve the overall efficiency of such knowledge discovery. To solve this data mining need not efficiently handled by traditional information extraction and retrieval techniques, we propose a Block Suffix Shifting-based approach, which is an improvement over the state of the art multi-string matching algorithms such as Aho-Corasick, Commentz-Walter, and Wu-Manber. The strength of our approach is its ability to exploit the different block structures of domain specific data for off-line and online parallel matching. Experiments on several real world datasets show how our approach translates into significant performance improvements.

[1]  Martin Roesch,et al.  Snort - Lightweight Intrusion Detection for Networks , 1999 .

[2]  Bruce W. Watson,et al.  A new family of string pattern matching algorithms , 2003, South Afr. Comput. J..

[3]  Beate Commentz-Walter,et al.  A String Matching Algorithm Fast on the Average , 1979, ICALP.

[4]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[5]  Jinlin Chen,et al.  Mining contiguous sequential patterns from web logs , 2007, WWW '07.

[6]  Dietmar Seipel,et al.  Matching web site structure and content , 2004, WWW Alt. '04.

[7]  Yanggon Kim,et al.  A Fast Multiple String-Pattern Matching Algorithm , 1999 .

[8]  Lucian Vlad Lita,et al.  Finding a Haystack in Haystacks - Simultaneous Identification of Concepts in Large Bio-Medical Corpora , 2008, SDM.

[9]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[10]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[11]  Joshua Goodman,et al.  Finding advertising keywords on web pages , 2006, WWW '06.

[12]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[13]  Xiaotie Deng,et al.  A new suffix tree similarity measure for document clustering , 2007, WWW '07.

[14]  Kimmo Fredriksson On-line Approximate String Matching in Natural Language , 2006, Fundam. Informaticae.

[15]  Ben Shneiderman,et al.  Discovering interesting usage patterns in text collections: integrating text mining with visualization , 2007, CIKM '07.

[16]  Udi Manber,et al.  Fast Text Searching With Errors , 2005 .

[17]  Udi Manber,et al.  A FAST ALGORITHM FOR MULTI-PATTERN SEARCHING , 1999 .

[18]  Christopher Krügel,et al.  SecuBat: a web vulnerability scanner , 2006, WWW '06.

[19]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.