Using genetic algorithms in word-vector optimisation

Word vectors and sets of words are used in a wide range of text-based applications. Yet these word sets are often chosen on an ad hoc basis. In this study, we examine two text-based applications that use word sets and in both cases find that classification performance can be optimised using a fairly simple genetic algorithm. The first study is in authorship attribution, the second one is sentiment analysis and in both cases classification precision can be improved using a genetic algorithm. In authorship attribution, in recent years the trend has been towards ever larger word vectors [1,2]. We suggest that this might be a counter-productive step as it can easily lead to inaccuracy caused by overfitting or vector-space sparsity (the curse of dimensionality). In sentiment analysis precision is the main issue as rates of greater than 80–85% are not easy to achieve.