A hadoop based platform for natural language processing of web pages and documents

Abstract The rapid and extensive pervasion of information through the web has enhanced the diffusion of a huge amount of unstructured natural language textual resources. A great interest has arisen in the last decade for discovering, accessing and sharing such a vast source of knowledge. For this reason, processing very large data volumes in a reasonable time frame is becoming a major challenge and a crucial requirement for many commercial and research fields. Distributed systems, computer clusters and parallel computing paradigms have been increasingly applied in the recent years, since they introduced significant improvements for computing performance in data-intensive contexts, such as Big Data mining and analysis. Natural Language Processing, and particularly the tasks of text annotation and key feature extraction, is an application area with high computational requirements; therefore, these tasks can significantly benefit of parallel architectures. This paper presents a distributed framework for crawling web documents and running Natural Language Processing tasks in a parallel fashion. The system is based on the Apache Hadoop ecosystem and its parallel programming paradigm, called MapReduce. In the specific, we implemented a MapReduce adaptation of a GATE application and framework (a widely used open source tool for text engineering and NLP). A validation is also offered in using the solution for extracting keywords and keyphrase from web documents in a multi-node Hadoop cluster. Evaluation of performance scalability has been conducted against a real corpus of web pages and documents.

[1]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[2]  Rudy Setiono,et al.  Keyword extraction using backpropagation neural networks and rule extraction , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[3]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[4]  Paolo Nesi,et al.  A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents , 2015, DMS.

[5]  Dan Roth,et al.  Learning Based Java for Rapid Development of NLP Systems , 2010, LREC.

[6]  Vishal Gupta,et al.  Effective Approaches For Extraction Of Keywords , 2010 .

[7]  Fabrizio Silvestri,et al.  Design of a Parallel and Distributed Web Search Engine , 2004, ArXiv.

[8]  Nancy Hitschfeld-Kahler,et al.  A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures , 2014 .

[9]  Chengzhi Zhang,et al.  Automatic Keyword Extraction from Documents Using Conditional Random Fields , 2008 .

[10]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[11]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[12]  Maria P. Grineva,et al.  Extracting key terms from noisy and multitheme documents , 2009, WWW '09.

[13]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[14]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[15]  Michael W. Godfrey,et al.  Mining modern repositories with elasticsearch , 2014, MSR 2014.

[16]  Pierre Nugues,et al.  KOSHIK- A Large-scale Distributed Computing Framework for NLP , 2014, ICPRAM.

[17]  Chunguo Wu,et al.  Machine Learning-Based Keywords Extraction for Scientific Literature , 2007, J. Univers. Comput. Sci..

[18]  Laxmikant V. Kale,et al.  Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects , 2009 .

[19]  Paolo Napoletano,et al.  Text classification using a few labeled examples , 2014, Comput. Hum. Behav..

[20]  Rafeeq Al-Hashemi,et al.  Text Summarization Extraction System (TSES) Using Extracted Keywords , 2010, Int. Arab. J. e Technol..

[21]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[22]  Thierry Hamon,et al.  Ogmios: a scalable NLP platform for annotating large web document collections , 2007 .

[23]  Dan I. Moldovan,et al.  Parallel Natural Language Processing on a Semantic Network Array Processor , 1995, IEEE Trans. Knowl. Data Eng..

[24]  GhemawatSanjay,et al.  The Google file system , 2003 .

[25]  Paolo Napoletano,et al.  Weighted Word Pairs for query expansion , 2015, Inf. Process. Manag..

[26]  Laxmikant V. Kale,et al.  Efficient Development of Parallel NLP Applications , 2013 .

[27]  Vincent Ng,et al.  Automatic Keyphrase Extraction: A Survey of the State of the Art , 2014, ACL.

[28]  Feifan Liu,et al.  Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts , 2009, NAACL.

[29]  Mita Nasipuri,et al.  A New Approach to Keyphrase Extraction Using Neural Networks , 2010, ArXiv.

[30]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[31]  Nancy Ide,et al.  International Standard for a Linguistic Annotation Framework , 2003, Natural Language Engineering.

[32]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[33]  Rohan Arora,et al.  Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means , 2015 .

[34]  Marcel P. van Lohuizen,et al.  Parallel processing of natural language parsers , 2000, PARCO.

[35]  Zhiyuan Liu,et al.  Clustering to Find Exemplar Terms for Keyphrase Extraction , 2009, EMNLP.

[36]  Kalina Bontcheva,et al.  GATECloud.net: a platform for large-scale, open-source text processing on the cloud , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[37]  Aditi Sharan,et al.  Keyword and Keyphrase Extraction Techniques: A Literature Review , 2015 .

[38]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.