Towards a Bio-inspired Approach to Match Heterogeneous Documents

Matching heterogeneous text documents coming from different sources means matching data extracted from these documents, generally structured in the form of vectors. The accuracy of matching directly depends on the right choice of the content of these vectors. That's why we need to select the best features. In this paper, we present a new approach to select the minimum set of features that represents the semantics of a set of text documents, using a quantum inspired genetic algorithm. Among different Vs characterizing the big data we focus on 'Variety' criterion, therefore, we used three sets of different sources that are semantically similar to retrieve their best features which describe the semantics of the corpus. In the matching phase, our approach shows significant improvement compared with the classic 'Bag-of-words' approach. (Resume d'auteur)

[1]  Ghassan Kanaan,et al.  Text Feature Selection using Particle Swarm Optimization Algorithm , 2009 .

[2]  Jack Sklansky,et al.  A note on genetic algorithms for large-scale feature selection , 1989, Pattern Recognit. Lett..

[3]  Salim Chikhi,et al.  Evolution d'Automate Cellulaire par Algorithme Genetique Quantique , 2009, CIIA.

[4]  Luiz Eduardo Soares de Oliveira,et al.  A Methodology for Feature Selection Using Multiobjective Genetic Algorithms for Handwritten Digit String Recognition , 2003, Int. J. Pattern Recognit. Artif. Intell..

[5]  Mengjie Zhang,et al.  Multi-objective particle swarm optimisation (PSO) for feature selection , 2012, GECCO '12.

[6]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[7]  E. Talbi,et al.  A Genetic Algorithm for Feature Selection in Data-Mining for Genetics , 2001 .

[8]  Ahmed Al-Ani Ant Colony Optimization for Feature Subset Selection , 2005, WEC.

[9]  Simon Réhel Catégorisation automatique de textes et cooccurrence de mots provenant de documents non étiquetés , 2005 .

[10]  Jinsong Leng,et al.  A genetic Algorithm-Based feature selection , 2014 .

[11]  Nick Craswell Mean Reciprocal Rank , 2009, Encyclopedia of Database Systems.

[12]  Nasser Ghasem-Aghaee,et al.  Text feature selection using ant colony optimization , 2009, Expert Syst. Appl..

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Kazuyuki Murase,et al.  A new hybrid ant colony optimization algorithm for feature selection , 2012, Expert Syst. Appl..

[15]  Silvia Casado Yusta,et al.  Different metaheuristic strategies to solve the feature selection problem , 2009, Pattern Recognit. Lett..