Automatic Arabic term extraction from special domain corpora

The availability of machine-readable Arabic special domain text in digital libraries, websites of Arabic university publications, and refereed journals fosters numerous interesting studies and applications. Among these applications is automatic term extraction from special domain corpora. These extracted terms can serve as a foundation for other applications and research, such as special domain dictionary building, terminology resource creation, and special domain ontology construction. Our literature survey shows a lack of such studies for Arabic special domain text; moreover, the few studies that have been identified use complex and computationally expensive methods. In this study, we use two basic methods to automatically extract terms from Arabic special domain corpora. Our methods are based on two simple heuristics. The most frequent words and n-grams in special domain corpora are typically terms, which themselves are typically bounded by functional words. We applied our methods on a corpus of applied Arabic linguistics. We obtained results comparable to those of other Arabic term extraction studies in that they exhibited 87% accuracy when only terms strictly pertaining to the field of applied Arabic linguistics were considered, and 93.7% when related terms were included.

[1]  Witold Abramowicz,et al.  Proximity Window Context Method for Term Extraction in Ontology Learning from Text , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[2]  Maria Teresa Pazienza,et al.  Modelling syntactic context in automatic term extractionRoberto , 2010 .

[3]  Song Liu,et al.  Automatic Technical Term Extraction Based on Term Association , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[4]  Khurshid Ahmad,et al.  Can Text Analysis Tell us Something about Technology Progress? , 2003, ACL 2003.

[5]  Masao Fuketa,et al.  Automatic Building an Extensive Arabic FA Terms Dictionary , 2010 .

[6]  Munpyo Hong,et al.  Hybrid Filtering for Extraction of Term Candidates from German Technical Texts , 2001 .

[7]  Ibrahim Bounhas,et al.  A hybrid approach for Arabic multi-word term extraction , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[8]  Patrick Pantel,et al.  A Statistical Corpus-Based Term Extractor , 2001, Canadian Conference on AI.

[9]  Magnus Merkel,et al.  Using machine learning to perform automatic term recognition , 2010 .

[10]  Khalid Al Khatib,et al.  Automatic extraction of Arabic multi-word terms , 2010, Proceedings of the International Multiconference on Computer Science and Information Technology.

[11]  Amr Kandil,et al.  Concept Relation Extraction from Construction Documents Using Natural Language Processing , 2010 .

[12]  Euripides G. M. Petrakis,et al.  Medical Document Indexing and Retrieval: AMTEx vs. NLM MMTx , 2007 .

[13]  Michael C. McCord,et al.  Terminology extraction for global content management , 2003 .

[14]  Driss Aboutajdine,et al.  A Multi-Word Term Extraction Program for Arabic Language , 2008, LREC.

[15]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[16]  Atsushi Fujii Producing an Encyclopedic Dictionary using Patent Documents , 2008, LREC.

[17]  Christian Federmann,et al.  From Statistical Term Extraction to Hybrid Machine Translation , 2011, EAMT.

[18]  Khurshid Ahmad,et al.  Knowledge maps as lexical signatures of journals papers and patent documents , 2003, Proceedings on Seventh International Conference on Information Visualization, 2003. IV 2003..

[19]  Mohammed Albared,et al.  Arabic term extraction using combined approach on Islamic document , 2013 .

[20]  John DeNero,et al.  A Class-Based Agreement Model for Generating Accurately Inflected Translations , 2012, ACL.

[21]  Horacio Rodríguez,et al.  Improving Term Extraction by System Combination Using Boosting , 2001, ECML.

[22]  Josef van Genabith,et al.  Automatic Extraction of Arabic Multiword Expressions , 2010, MWE@COLING.

[23]  Chunyu Kit,et al.  Automatic Chinese Multi-word Term Extraction , 2008, 2008 International Conference on Advanced Language Processing and Web Information Technology.

[24]  Doaa Samy,et al.  Medical Term Extraction in an Arabic Medical Corpus , 2012, LREC.

[25]  Mona T. Diab,et al.  Building an Arabic Multiword Expressions Repository , 2012, SPMRL@ACL 2012.

[26]  Diana Maynard,et al.  NLP Techniques for Term Extraction and Ontology Population , 2008, Ontology Learning and Population.

[27]  Siham Boulaknadel Impact of Term-Indexing for Arabic Document Retrieval , 2008, NLDB.

[28]  Khurshid Ahmad,et al.  Corpus Linguistics and Terminology Extraction , 2001 .