Parallel Information Extraction on Shared Memory Multi-processor System

Text mining is one of the best solutions for today and the future's information explosion. With the development of modern processor technologies, it will be a mass market desktop application in the many-core era. In text mining system, information extraction is a representative module and is the most compute intensive part. In this paper, we study the performance of parallel information extraction on shared memory multi-processor systems in order to gain some insights of such applications on the future's many-core architecture. In implementation, conditional random fields (CRFs) algorithm is selected as the core of module information extraction. Based on the newest CRFs toolkit FlexCRFs, we make several serial optimizations and then parallelize it with MPI and System V. IPC/shm. We also conduct a detailed performance analysis of this parallel application on the target system