K-means, HAC and FCM Which Clustering Approach for Arabic Text?

Today, we are witnessing rapid growth in Web resources that allow Internet users to express and share their ideas, opinions, and judgments on a variety of issues. Several classification approaches have been proposed to classify textual data. But all these approaches require us to label the clusters we want to obtain. Which, in reality, is not available because we do not know in advance the information that can be proposed through these opinions. To overcome this constraint, clustering approaches such as K-mean, HAC or FCM can be exploited. In this paper, we present and compare these approaches. And to show the importance of exploiting clustering algorithms, to classify and analyze textual data in Arabic. By applying them to a real case that has created a great debate in Morocco, which is the case of teachers contracting with academies.

[1]  Christodoulos A. Floudas,et al.  Determining the Optimal Number of Clusters , 2009, Encyclopedia of Optimization.

[2]  George W. Adamson,et al.  The use of an association measure based on character structure to identify semantically related pairs of words and document titles , 1974, Inf. Storage Retr..

[3]  Izzat Alsmadi,et al.  The Effect of Stemming on Arabic Text Classification: An Empirical Study , 2011, Int. J. Inf. Retr. Res..

[4]  Adel M. Alimi,et al.  Survey on clustering methods: Towards fuzzy clustering for big data , 2014, 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR).

[5]  Khairuddin Omar,et al.  Methods of Arabic Language Baseline Detection - The State of Art , 2008 .

[6]  Muhammad Kashif Hanif,et al.  Text Mining: Techniques, Applications and Issues , 2016 .

[7]  Mohammad-Reza Feizi-Derakhshi,et al.  Review and Comparison between Clustering Algorithms with Duplicate Entities Detection Purpose , 2012 .

[8]  J. L. Warner,et al.  Cluster Analysis Applied to the Validation of Course Objectives , 1976 .

[9]  Imane Bouhaddou,et al.  A survey of clustering algorithms for an industrial context , 2019, Procedia Computer Science.

[10]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[11]  Malika Charrad,et al.  NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set , 2014 .

[12]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[13]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[14]  Lamia Hadrich Belguith,et al.  Segmenting Arabic Texts into Elementary Discourse Units (Segmentation de textes arabes en unités discursives minimales) [in French] , 2013, TALN.

[15]  R. Duwairi,et al.  Stemming Versus Light Stemming as Feature Selection Techniques for Arabic Text Categorization , 2007, 2007 Innovations in Information Technologies (IIT).

[16]  Goutam Chakraborty,et al.  Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS , 2013 .

[17]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[18]  Himansu Sekhar Behera,et al.  Fuzzy C-Means (FCM) Clustering Algorithm: A Decade Review from 2000 to 2014 , 2015 .

[19]  Osama Abu Abbas,et al.  Comparisons Between Data Clustering Algorithms , 2008, Int. Arab J. Inf. Technol..

[20]  Claudia Plant,et al.  KMN - Removing Noise from K-Means Clustering Results , 2018, DaWaK.

[21]  J. Cadzow,et al.  An extrapolation procedure for band-limited signals , 1979 .

[22]  Ali Idri,et al.  A new clustering approach to identify the values to query the deep web access forms , 2018, 2018 4th International Conference on Computer and Technology Applications (ICCTA).

[23]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[24]  Michael Q. Zhang,et al.  Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data , 2002 .

[25]  P. H. A. Sneath,et al.  Some experiments in the numerical analysis of archaeological data , 1966 .

[26]  Aaron E. Rosenberg,et al.  Speaker independent recognition of isolated words using clustering techniques , 1979, ICASSP.

[27]  Samhaa R. El-Beltagy,et al.  AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP , 2017, ACLING.

[28]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[29]  Raihana Ferdous,et al.  An efficient k-means algorithm integrated with Jaccard distance measure for document clustering , 2009, 2009 First Asian Himalayas International Conference on Internet.