Efficient Calculation of Maximum Likelihood Estimation for Authorship Attribution Using Lexical and Syntactic Features
暂无分享,去创建一个
In this paper two models for Authorship Attribution using Bayesian approach are compared. Authorship attribution deals with the ascertainment of the actual author for a particular text. When two authors, say A1 and A2, claim to be the author of a particular essay, the real author is to be found out. For solving such a problem usually maximum likelihood estimation (MLE) for the authors under dispute is computed i.e., train a probabilistic model for author A1 and another probabilistic model for author A2. Then using those, calculate the MLE. This method is known as Bayesian approach. For doing this an unknown text and two authors with a large text sample each are needed. To calculate the maximum likelihood unigram, bigram or trigram models can be chosen. Usually unigrams are chosen; number of occurrences of those unigrams are found out; their probabilities are calculated. Based on the higher probability actual author is ascertained. The above seen is the method commonly used for Authorship Attribution. In this paper another method which consider the singleton unigram words is going to be used, that is, the words that have occurred only once in the text under dispute or “the unknown text”. In this paper, vocabulary usage to ascertain the original author is concentrated upon. Also an advanced method of using further grammatical features like Syntactic features is proposed. Both singleton unigram model and unigram model are used to find out the maximum likelihood estimate.