A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents

The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order to automatically ingest and process such huge amounts of data, single-machine, non-distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and analysis. Current Natural Language Processing (NLP) systems are growing in complexity, and computational power needs have been significantly increased, requiring solutions such as distributed frameworks and parallel computing programming paradigms. This paper presents a distributed framework for executing NLP related tasks in a parallel environment. This has been achieved by integrating the APIs of the widespread GATE open source NLP platform in a multi-node cluster, built upon the open source Apache Hadoop file system. The proposed framework has been evaluated against a real corpus of web pages and documents.

[1]  Marcel P. van Lohuizen,et al.  Parallel processing of natural language parsers , 2000, PARCO.

[2]  Vincent Ng,et al.  Automatic Keyphrase Extraction: A Survey of the State of the Art , 2014, ACL.

[3]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[4]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[5]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[6]  Laxmikant V. Kale,et al.  Efficient Development of Parallel NLP Applications , 2013 .

[7]  Rohan Arora,et al.  Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means , 2015 .

[8]  Zhiyuan Liu,et al.  Clustering to Find Exemplar Terms for Keyphrase Extraction , 2009, EMNLP.

[9]  Kalina Bontcheva,et al.  GATECloud.net: a platform for large-scale, open-source text processing on the cloud , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[10]  Laxmikant V. Kale,et al.  Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects , 2009 .

[11]  Thierry Hamon,et al.  Ogmios: a scalable NLP platform for annotating large web document collections , 2007 .

[12]  Dan I. Moldovan,et al.  Parallel Natural Language Processing on a Semantic Network Array Processor , 1995, IEEE Trans. Knowl. Data Eng..

[13]  Paolo Napoletano,et al.  Weighted Word Pairs for query expansion , 2015, Inf. Process. Manag..

[14]  Pierre Nugues,et al.  KOSHIK- A Large-scale Distributed Computing Framework for NLP , 2014, ICPRAM.

[15]  Chunguo Wu,et al.  Machine Learning-Based Keywords Extraction for Scientific Literature , 2007, J. Univers. Comput. Sci..

[16]  Vishal Gupta,et al.  Effective Approaches For Extraction Of Keywords , 2010 .

[17]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[18]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[19]  Paolo Napoletano,et al.  Text classification using a few labeled examples , 2014, Comput. Hum. Behav..

[20]  Rafeeq Al-Hashemi,et al.  Text Summarization Extraction System (TSES) Using Extracted Keywords , 2010, Int. Arab. J. e Technol..

[21]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[22]  Maria P. Grineva,et al.  Extracting key terms from noisy and multitheme documents , 2009, WWW '09.

[23]  Michael W. Godfrey,et al.  Mining modern repositories with elasticsearch , 2014, MSR 2014.

[24]  Chengzhi Zhang,et al.  Automatic Keyword Extraction from Documents Using Conditional Random Fields , 2008 .

[25]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[26]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[27]  Dan Roth,et al.  Learning Based Java for Rapid Development of NLP Systems , 2010, LREC.

[28]  Laurent Romary,et al.  International standard for a linguistic annotation framework , 2003, HLT-NAACL 2003.

[29]  Feifan Liu,et al.  Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts , 2009, NAACL.