Hardware accelerated algorithms for semantic processing of document streams

There is a need within the intelligence communities to analyze massive streams of multilingual unstructured data. Mathematical transformation algorithms have proven effective at interpreting multilingual, unstructured data, but high computational requirements of such algorithms prevent their widespread use. The rate of computation can be vastly increased with field programmable gate array (FPGA) hardware. To experiment with this approach, we developed a system with FPGAs that ingests content over a network at high data rates. The system extracts basewords, counts words, scores documents, and discovers concepts on data that are carried in TCP/IP network flows as packets over a Gigabit Ethernet link or in cells transported over an OC48 link. These algorithms, as implemented in FPGA hardware, introduce certain constraints on the complexity and richness of the semantic processing algorithms. To understand the implications of these constraints and to benchmark the performance of the system, we have performed a series of experiments processing multilingual documents. In these experiments, we compare techniques to generate basewords for our semantic concepts, score documents, and discover concepts across a variety of processing operational scenarios

[1]  John W. Lockwood,et al.  A Modular System for FPGA-Based TCP Flow Processing in High-Speed Networks , 2004, FPL.

[2]  Giovanni Vigna,et al.  A stateful intrusion detection system for World-Wide Web servers , 2003, 19th Annual Computer Security Applications Conference, 2003. Proceedings..

[3]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[4]  John W. Lockwood,et al.  HAIL: a hardware-accelerated algorithm for language identification , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[5]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[6]  Tong Zhang,et al.  An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods , 2001, AI Mag..

[7]  Rudolf Kruse,et al.  Relevance Feedback for Association Rules by Leveraging Concepts from Information Retrieval , 2007, SGAI Conf..

[8]  S.G. Eick,et al.  Transformation Algorithms for Data Streams , 2005, 2005 IEEE Aerospace Conference.

[9]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[10]  John W. Lockwood,et al.  Architecture for a hardware-based, TCP/IP content-processing system , 2004, IEEE Micro.

[11]  C. J. Fall,et al.  Literature survey : Issues to be considered in the automatic classification of patents , 2002 .

[12]  Dayne Freitag,et al.  Towards Full Automation of Lexicon Construction , 2004, HLT-NAACL 2004.

[13]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[14]  John W. Lockwood Evolvable Internet hardware platforms , 2001, Proceedings Third NASA/DoD Workshop on Evolvable Hardware. EH-2001.

[15]  J. Byrnes,et al.  Text Modeling for Real-Time Document Categorization , 2005, 2005 IEEE Aerospace Conference.

[16]  Yong Wang,et al.  Classification of Web documents using a naive Bayes method , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[17]  John W. Lockwood,et al.  Deep packet inspection using parallel bloom filters , 2004, IEEE Micro.

[18]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[19]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[20]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[21]  Marina MeWi Comparing Clusterings , 2002 .