Automatic Building an Extensive Arabic FA Terms Dictionary

Field Association (FA) terms are a limited set of discriminating terms that give us the knowledge to identify document fields which are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract automatically relevant Arabic FA Terms to build a comprehensive dictionary. Moreover, all previous studies are based on FA terms in English and Japanese, and the extension of FA terms to other language such Arabic could be definitely strengthen further researches. This paper presents a new method to extract, Arabic FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules and corpora comparison. Experimental evaluation is carried out for 14 different fields using 251 MB of domain-specific corpora obtained from Arabic Wikipedia dumps and Alhyah news selected average of 2,825 FA Terms (single and compound) per field. From the experimental results, recall and precision are 84% and 79% respectively. Therefore, this method selects higher number of relevant Arabic FA Terms at high precision and recall. Keywords—Arabic Field Association Terms, information extraction, document classification, information retrieval.

[1]  Masao Fuketa,et al.  Improvement of building field association term dictionary using passage retrieval , 2007, Inf. Process. Manag..

[2]  Bruce R. Schatz,et al.  Extracting noun phrases for all of MEDLINE , 1999, AMIA.

[3]  Masao Fuketa,et al.  Documents similarity measurement using field association terms , 2003, Inf. Process. Manag..

[4]  P. Langlais Corpus-Based Terminology Extraction , 2005 .

[5]  Masao Fuketa,et al.  A new method for selecting English field association terms of compound words and its knowledge representation , 2002, Inf. Process. Manag..

[6]  Masao Fuketa,et al.  Automatic building of new Field Association word candidates using search engine , 2006, Inf. Process. Manag..

[7]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[8]  Masao Fuketa,et al.  Ranking of field association terms using Co-word analysis , 2008, Inf. Process. Manag..

[9]  Masao Fuketa,et al.  A document classification method by using field association words , 2000, Inf. Sci..

[10]  Shaoning Pang,et al.  Encoding and decoding the knowledge of association rules over SVM classification trees , 2009, Knowledge and Information Systems.

[11]  Chew Lim Tan,et al.  A comprehensive comparative study on term weighting schemes for text categorization with support vector machines , 2005, WWW '05.

[12]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[13]  Jian Hu,et al.  Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[14]  Wanli Zuo,et al.  SVM based adaptive learning method for text classification from positive and unlabeled documents , 2008, Knowledge and Information Systems.

[15]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[16]  Paola Velardi,et al.  Mining the Web to Create Specialized Glossaries , 2008, IEEE Intelligent Systems.

[17]  Patrick Drouin,et al.  Detection of Domain Specific Terminology Using Corpora Comparison , 2004, LREC.

[18]  Masami Shishibori,et al.  Extraction of field-coherent passages , 2002, Inf. Process. Manag..

[19]  Guoqian Jiang,et al.  Extraction of Specific Nursing Terms Using Corpora Comparison , 2005, AMIA.

[20]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.