论文信息 - Using genetic algorithms in word-vector optimisation

Using genetic algorithms in word-vector optimisation

Word vectors and sets of words are used in a wide range of text-based applications. Yet these word sets are often chosen on an ad hoc basis. In this study, we examine two text-based applications that use word sets and in both cases find that classification performance can be optimised using a fairly simple genetic algorithm. The first study is in authorship attribution, the second one is sentiment analysis and in both cases classification precision can be improved using a genetic algorithm. In authorship attribution, in recent years the trend has been towards ever larger word vectors [1,2]. We suggest that this might be a counter-productive step as it can easily lead to inaccuracy caused by overfitting or vector-space sparsity (the curse of dimensionality). In sentiment analysis precision is the main issue as rates of greater than 80–85% are not easy to achieve.

Peter W.H. Smith | P. Smith

[1] J. M. Kittross. The measurement of meaning , 1959 .

[2] Hichem Frigui,et al. Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents , 2004 .

[3] John Burrows,et al. 'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[4] N. L. Johnson,et al. Multivariate Analysis , 1958, Nature.

[5] Andrea Esuli,et al. SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[6] Peter W. H. Smith,et al. The Authorship of The American Declaration of Independence , 2008 .

[7] David L. Hoover,et al. Delta Prime? , 2004, Lit. Linguistic Comput..

[8] Bo Pang,et al. Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[9] David L. Hoover,et al. Testing Burrows's Delta , 2004, Lit. Linguistic Comput..

[10] Peter W. H. Smith,et al. Improving Authorship Attribution: Optimizing Burrows' Delta Method* , 2011, J. Quant. Linguistics.