An Evolutionary-Based Term Reduction Approach to Bilingual Clustering of Malay-English Corpora

The document clustering process groups the unstructured text documents into a predefined set of clusters in order to provide more information to the users. There are many studies conducted in clustering monolingual documents. With the enrichment of current technologies, the study of bilingual clustering would not be a problem. However clustering bilingual document is still facing the same problem faced by a monolingual document clustering which is the “curse of dimensionality”. Hence, this encourages the study of term reduction technique in clustering bilingual documents. The objective in this study is to study the effects of reducing terms considered in clustering bilingual corpus in parallel for English and Malay documents. In this study, a genetic algorithm (GA) is used in order to reduce the number of feature selected. A single-point crossover with a crossover rate of 0.8 is used. Not only that, this study also assesses the effects of applying different mutation rate (e.g., 0.1 and 0.01) in selecting the number of features used in clustering bilingual documents. The result shows that the implementation of GA does improve the clustering mapping compared to the initial clustering mapping. Not only that, this study also discovers that GA with a mutation rate of 0.01 produces the best parallel clustering mapping results compared to GA with a mutation rate of 0.1.

[1]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[2]  Malcolm I. Heywood,et al.  Comparing Dimension Reduction Techniques for Document Clustering , 2005, Canadian Conference on AI.

[3]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[4]  Rayner Alfred,et al.  OPTIMIZING CLUSTERS ALIGNMENT FOR BILINGUAL MALAY-ENGLISH CORPORA , 2012 .

[5]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[6]  Ibrahim Abu El-Khair,et al.  Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study , 2017, ArXiv.

[7]  Rabab Kreidieh Ward,et al.  Genetic algorithms for feature selection and weighting, a review and study , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[8]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[9]  Rayner Alfred,et al.  Malay named entity recognition based on rule-based approach , 2014 .

[10]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[11]  Rayner Alfred,et al.  Enhancing Malay Stemming Algorithm with Background Knowledge , 2012, PRICAI.

[12]  Samet Atdag,et al.  A comparison of named entity recognition tools applied to biographical texts , 2013, 2nd International Conference on Systems and Computer Science.

[13]  S.Chandrasekhar A. Anil Kumar,et al.  Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering , 2012 .

[14]  Raquel Martínez Unanue,et al.  NESM: a named entity based proximity measure for multilingual news clustering , 2012 .

[15]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[16]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[17]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[18]  Edward A. Fox,et al.  Recent Developments in Document Clustering , 2007 .

[19]  Soto Montalvo,et al.  Multilingual Document Clustering: An Heuristic Approach Based on Cognate Named Entities , 2006, ACL.

[20]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[21]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .