Parallel Text Mining for Large Text Processing

There is an urgent need to develop new text mining solutions using High Performance Computing (HPC) and grid environments to tackle the exponential growth in textual data. Problem sizes are increasing by the day by addition of new text documents. Therefore the aim of this work is to lay the foundations for mining large text datasets (i.e. full text articles) in reasonable timeframes. The task of labelling sequence data such as part-ofspeech (POS) tagging, chunking (shallow parsing) and named entity recognition is one of the most important tasks in Text Mining. This work focuses on state-of-the-art GENIA tagger and STEPP parser. GENIA is a POS tagger which is specifically tuned for biomedical text and STEPP is a full parser. A parallel version of GENIA and STEPP has been developed and performance has been compared on a number of different architectures. The focus has been particularly on scalability: scaling to 512 processors has been achieved. Furthermore, a parallel text mining framework has been proposed that enables scaling to 10000 processors for massively parallel Text Mining applications. The processing times have been reduced dramatically for the given datasets from over 70 days to hours (towards 3 orders of magnitude reduction). The parallel implementation is done using Message Passing Interface (MPI) to achieve portable code. The resulting parallel applications have been tested on a number of architectures and the entire collection of Medline text abstracts together with 125000 full text articles have been used for the tests.

[1]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[2]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[3]  Jun'ichi Tsujii,et al.  Ambiguous Part-of-Speech Tagging for Improving Accuracy and Domain Portability of Syntactic Parsers , 2007, IJCAI.

[4]  Terri K. Attwood,et al.  Classifying Protein Fingerprints , 2004, PKDD.

[5]  Yakushiji Biomedical Information Extraction with Predicate-Argument Structure Patterns , 2005 .

[6]  William B. Langdon,et al.  BioRAT: extracting biological information from full-length papers , 2004, Bioinform..

[7]  Daniel Marcu,et al.  Statistics-Based Summarization - Step One: Sentence Compression , 2000, AAAI/IAAI.

[8]  Sophia Ananiadou,et al.  Introduction to Text Mining in Biology , 2006 .

[9]  Xiao Qin,et al.  Performance comparisons of load balancing algorithms for I/O-intensive workloads on clusters , 2008, J. Netw. Comput. Appl..

[10]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[11]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[12]  Kentaro Torisawa,et al.  An Agent-based Parallel HPSG Parser for Shared-memory Parallel Machines , 2001 .

[13]  Yusuke Miyao,et al.  Fast and scalable HPSG parsing , 2006 .

[14]  Jun'ichi Tsujii,et al.  Probabilistic Disambiguation Models for Wide-Coverage HPSG Parsing , 2005, ACL.

[15]  M. de Rijke,et al.  Deploying Lucene on the Grid , 2006 .

[16]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[17]  Jeyakumar Natarajan,et al.  Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line , 2006, BMC Bioinformatics.

[18]  K. Taura GXP : An Interactive Shell for the Grid Environment , 2004, Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04).

[19]  Robert Sanderson,et al.  Indexing and searching tera-scale Grid-Based Digital Libraries , 2006, InfoScale '06.

[20]  Eric G. Bremer,et al.  Analysis of Protein/Protein Interactions Through Biomedical Literature: Text Mining of Abstracts vs. Text Mining of Full Text Articles , 2004, KELSI.

[21]  Fabien Campagne,et al.  Building a protein name dictionary from full text: a machine learning term extraction approach , 2005, BMC Bioinformatics.

[22]  Jun'ichi Tsujii,et al.  Task-oriented Evaluation of Syntactic Parsers and Their Representations , 2008, ACL.

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  Sophia Ananiadou,et al.  Fast Full Parsing by Linear-Chain Conditional Random Fields , 2009, EACL.

[25]  Jun'ichi Tsujii,et al.  Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases , 2006, ACL.

[26]  Tsujii Jun'ichi,et al.  Efficient HPSG Parsing with Supertagging and CFG-filtering , 2006 .

[27]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[28]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.