Entity resolution acceleration using the automata processor

Entity Resolution (ER), the process of finding identical entities across different databases, is critical to many information-integration applications. As sizes of databases explode in the big-data era, it becomes computationally expensive to recognize identical entities among all records with variations allowed across multiple databases. Profiling results show that approximate matching is the primary bottleneck. The Automata Processor (AP), an efficient and scalable semiconductor architecture for parallel automata processing, provides a new opportunity for hardware acceleration for ER. We propose an AP-accelerated ER solution, which accelerates the performance bottleneck of fuzzy matching for similar but potentially inexactly-matched names, and use several different real-world applications to illustrate its effectiveness. We compared the proposed method with several conventional methods and achieved both promising speedups and better accuracy (more correct pairs and less generalized merge distance cost) for different datasets.

[1]  Srinivas Aluru,et al.  Finding Motifs in Biological Sequences Using the Micron Automata Processor , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[2]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[3]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[4]  Erhard Rahm,et al.  Parallel Entity Resolution with Dedoop , 2012, Datenbank-Spektrum.

[5]  Hector Garcia-Molina,et al.  Evaluating entity resolution results , 2010, Proc. VLDB Endow..

[6]  Dave Brown,et al.  Supplementary Material for An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing , 2013 .

[7]  Kevin Skadron,et al.  Brill tagging on the Micron Automata Processor , 2015, Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015).

[8]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[9]  Kevin Skadron,et al.  Nondeterministic Finite Automata in Hardware-the Case of the Levenshtein Automaton , 2015 .

[10]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[11]  Yanjun Qi,et al.  Association Rule Mining with the Micron Automata Processor , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[12]  Daniel V. Pitti Social Networks and Archival Context , 2014 .

[13]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.