N-Gram-Based Techniques for Arabic Text Document Matching; Case Study: Courses Accreditation

Measuring text similarity has been studied for a long time due to its importance in many applications in natural language processing and related areas such as Web-based document searching. One such possible application which is investigated in this paper is determining the similarity between course descriptions of the same subject for credit transfer among various universities or similar academic programs. In this paper, three different bi-gram techniques have been used to calculate the similarity between two or more Arabic documents which take the form of course descriptions. One of the techniques uses the vector model to represent each document in a way that each bi-gram is associated with a weight that reflects the importance of the bi-gram in the document. Then the cosine similarity is used to compute the similarity between the two vectors. The other two techniques are: word-based and whole document-based evaluation techniques. In both techniques, the Dice’s similarity measure has been applied for calculating the similarity between any given pair of documents. The results of this research indicate that the first technique has demonstrated better performance than the other two techniques as viewed with respect to the human judgment.