论文信息 - Robotomata: A framework for approximate pattern matching of big data on an automata processor

Robotomata: A framework for approximate pattern matching of big data on an automata processor

Approximate pattern matching (APM) has been widely used in big data applications, e.g., genome data analysis, speech recognition, fraud detection, computer vision, etc. Although an automata-based approach is an efficient way to realize APM, the inherent sequentiality of automata deters its implementation on general-purpose parallel platforms, e.g., multicore CPUs and many-core GPUs. Recently, however, Micron has proposed its Automata Processor (AP), a processing-in-memory (PIM) architecture dedicated for non-deterministic automata (NFA) simulation. It has nominally achieved thousands-fold speedup over a multicore CPU for many big data applications. Alas, the AP ecosystem suffers from two major problems. First, the current APIs of AP require manual manipulations of all computational elements. Second, multiple rounds of time-consuming compilation are needed for large datasets. Both problems hinder programmer productivity and end-to-end performance. Therefore, we propose a paradigm-based approach to hierarchically generate automata on AP and use this approach to create Robotomata, a framework for APM on AP. By taking in the following inputs — the types of APM paradigms, desired pattern length, and allowed number of errors as input — our framework can generate the optimized APM-automata codes on AP, so as to improve programmer productivity. The generated codes can also maximize the reuse of pre-compiled macros and significantly reduce the time for reconfiguration. We evaluate Robotomata by comparing it to two state-of-the-art APM implementations on AP with real-world datasets. Our experimental results show that our generated codes can achieve up to 30.5x and 12.8x speedup with respect to configuration while maintaining the computational performance. Compared to the counterparts on CPU, our codes achieve up to 393x overall speedup, even when including the reconfiguration costs. We highlight the importance of counting the configuration time towards the overall performance on AP, which would provide better insight in identifying essential hardware features, specifically for large-scale problem sizes.

[1] Dave Brown,et al. Supplementary Material for An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing , 2013 .

[2] Xiaodong Yu,et al. Revisiting State Blow-Up: Automatically Building Augmented-FA While Preserving Functional Equivalence , 2014, IEEE Journal on Selected Areas in Communications.

[3] Kevin Skadron,et al. ANMLzoo: a benchmark suite for exploring bottlenecks in automata processing engines and architectures , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[4] Kevin Skadron,et al. Nondeterministic Finite Automata in Hardware-the Case of the Levenshtein Automaton , 2015 .

[5] Wu-chun Feng,et al. O3FA: A scalable finite automata-based pattern-matching engine for out-of-order deep packet inspection , 2016, 2016 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[6] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[7] Kevin Skadron,et al. RAPID Programming of Pattern-Recognition Processors , 2016, International Conference on Architectural Support for Programming Languages and Operating Systems.

[8] Hao Wang,et al. cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on a GPU , 2017, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[9] Shuaiwen Song,et al. Enabling scalability-sensitive speculative parallelization for FSM computations , 2017, ICS.

[10] Bronis R. de Supinski,et al. Directive-Based Partitioning and Pipelining for Graphics Processing Units , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11] Laxmi N. Bhuyan,et al. Compiling PCRE to FPGA for accelerating SNORT IDS , 2007, ANCS '07.

[12] Xiaodong Yu,et al. Exploring different automata representations for efficient regular expression matching on GPUs , 2013, PPoPP '13.

[13] Wu-chun Feng,et al. Demystifying automata processing: GPUs, FPGAs or Micron's AP? , 2017, ICS.

[14] Hao Wang,et al. GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs , 2017, Conf. Computing Frontiers.

[15] Kevin Skadron,et al. Entity resolution acceleration using the automata processor , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[16] Xiaodong Yu,et al. GPU acceleration of regular expression matching for large datasets: exploring the implementation space , 2013, CF '13.

[17] Indranil Roy,et al. Towards Machine Learning on the Automata Processor , 2016, ISC.

[18] Kevin Skadron,et al. Sequential pattern mining with the Micron automata processor , 2016, Conf. Computing Frontiers.

[19] Hao Wang,et al. An Enhanced Image Reconstruction Tool for Computed Tomography on GPUs , 2017, Conf. Computing Frontiers.

[20] Srinivas Aluru,et al. High Performance Pattern Matching Using the Automata Processor , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[21] Kevin Skadron,et al. Searching for Potential gRNA Off-Target Sites for CRISPR/Cas9 Using Automata Processing Across Different Platforms , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[22] Kevin Skadron,et al. Generating efficient and high-quality pseudo-random behavior on Automata Processors , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[23] Janet Kelso,et al. PatMaN: rapid alignment of short sequences to large databases , 2008, Bioinform..

[24] Wolfram Schulte,et al. Data-parallel finite-state machines , 2014, ASPLOS.

[25] Wu-chun Feng,et al. cuART: Fine-Grained Algebraic Reconstruction Technique for Computed Tomography Images on GPUs , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[26] Wu-chun Feng,et al. ASPaS: A Framework for Automatic SIMDization of Parallel Sorting on x86-based Many-core Processors , 2015, ICS.

[27] Andrew A. Chien,et al. Fast support for unstructured data processing: The unified automata processor , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28] Zhijia Zhao,et al. MicroSpec: Speculation-centric fine-grained parallelization for FSM computations , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[29] Michela Becchi,et al. Evaluating regular expression matching engines on network and general purpose processors , 2009, ANCS '09.

[30] Xiaodong Yu. Deep packet inspection on large datasets : algorithmic and parallelization techniques for accelerating regular expression matching on many-core processors , 2013 .

[31] Srinivas Aluru,et al. Finding Motifs in Biological Sequences Using the Micron Automata Processor , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[32] Patrick Crowley,et al. An improved algorithm to accelerate regular expression evaluation , 2007, ANCS '07.

[33] Kevin Skadron,et al. Frequent subtree mining on the automata processor: challenges and opportunities , 2017, ICS.

[34] James Demmel,et al. the Parallel Computing Landscape , 2022 .

[35] Weifeng Liu,et al. Fast segmented sort on GPUs , 2017, ICS.

[36] Srinivas Aluru,et al. Algorithmic Techniques for Solving Graph Problems on the Automata Processor , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[37] Jing Zhang,et al. Eliminating Irregularities of Protein Sequence Search on Multicore Architectures , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[38] Deyuan Guo,et al. Fast Track Pattern Recognition in High Energy Physics Experiments with the Automata Processor , 2016 .