Normalized Google Distance for Collocation Extraction from Islamic Domain

This study investigates the properties of Arabic collocations, and classifies them according to their structural patterns on Islamic domain. Based on linguistic information, the patterns and the variation of the collocations have been identified.  Then, a system that extracts the collocations from Islamic domain based on statistical measures has been described. In candidate ranking, the normalized Google distance has been adapted to measure the associations between the words in the candidates set. Finally, the n-best evaluation that selects n-best lists for each association measure has been used to annotate all candidates in these lists manually. The following association measures (log-likelihood ratio, t-score, mutual information, and enhanced mutual information) have been utilized in the candidate ranking step to compare these measures with the normalized Google distance in Arabic collocation extraction. In the experiment of this work, the normalized Google distance achieved the highest precision value 93% compared with other association measures. In fact, this strengthens our motivation to utilize the normalized Google distance to measure the relatedness between the constituent words of the collocations instead of using the frequency-based association measures as in the state-of-the-art methods. Keywords: normalized Google distance, collocation extraction, Islamic domain

[1]  Jan Hajiÿc,et al.  Feature-Based Tagger of Approximations of Functional Arabic Morphology , 2005 .

[2]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[3]  Mohammed A. Attia Accommodating Multiword Expressions in an Arabic LFG Grammar , 2006, FinTAL.

[4]  Josef van Genabith,et al.  Automatic Extraction of Arabic Multiword Expressions , 2010, MWE@COLING.

[5]  Mike Scott,et al.  Textual Patterns: Key words and corpus analysis in language education , 2006 .

[6]  Pavel Pecina Lexical Association Measures: Collocation Extraction , 2008 .

[7]  Christopher D. Manning,et al.  Advances in natural language processing , 2015, Science.

[8]  C. V. Ramamoorthy,et al.  Knowledge and Data Engineering , 1989, IEEE Trans. Knowl. Data Eng..

[9]  Violeta Seretan,et al.  Collocation extraction based on syntactic parsing , 2008 .

[10]  Driss Aboutajdine,et al.  A Multi-Word Term Extraction Program for Arabic Language , 2008, LREC.

[11]  Mark Johnson,et al.  Unsupervised learning of multi-word verbs , 2001 .

[12]  Mohd Juzaiddin Ab Aziz,et al.  The Enhancement of Arabic Stemming by Using Light Stemming and Dictionary-Based Stemming , 2011, J. Softw. Eng. Appl..

[13]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[14]  Xianpei Han,et al.  CASIANED: People Attribute Extraction based on Information Extraction , 2009 .

[15]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[16]  Qasem A. Al-Radaideh,et al.  Using N-grams for Arabic text searching , 2004, J. Assoc. Inf. Sci. Technol..

[17]  Gregory Grefenstette,et al.  The World Wide Web as a Resource for Example-Based Machine Translation Tasks , 1999, TC.

[18]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[19]  T. Van de Cruys,et al.  Lexico-Semantic Multiword Expression Extraction , 2007 .

[20]  Timothy Baldwin,et al.  Deep lexical acquisition of verb-particle constructions , 2005, Comput. Speech Lang..

[21]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[22]  Aline Villavicencio,et al.  Automated Multiword Expression Prediction for Grammar Engineering , 2006 .

[23]  Stefan Evert,et al.  Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties , 2006 .

[24]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[25]  Vladia Pinheiro,et al.  Natural Language Processing based on Semantic inferentialism for extracting crime information from text , 2010, 2010 IEEE International Conference on Intelligence and Security Informatics.

[26]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[27]  Ibrahim Bounhas,et al.  A hybrid approach for Arabic multi-word term extraction , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[28]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[29]  Abdulgabbar Saif,et al.  An Automatic Collocation Extraction from Arabic Corpus , 2011 .

[30]  H. Cunningham,et al.  Developing Language Processing Components with GATE , 2001 .

[31]  Douglas E. Appelt,et al.  Introduction to Information Extraction , 1999, AI Commun..

[32]  R. Tulloss Assessment of Similarity Indices for Undesirable Properties and a new Tripartite Similarity Index Based on Cost Functions , 1997 .

[33]  Carlos Ramisch,et al.  An Evaluation of Methods for the Extraction of Multiword Expressions , 2008, LREC 2008.

[34]  Pr. Mohamed Hassoun,et al.  On lemmatization in Arabic , A formal definition of the Arabic entries of multilingual lexical databases , 2001 .

[35]  Gaël Dias,et al.  Unsupervised Learning of Multiword Units from Part-of-Speech Tagged Corpora: Does Quantity Mean Quality? , 2005, EPIA.

[36]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[37]  Stefan Evert,et al.  Using small random samples for the manual evaluation of statistical association measures , 2005, Comput. Speech Lang..