Extraction of Authors' Charateristics from japanese Modern Setences via N-gram Distribution
暂无分享,去创建一个
Objects of many studies of authorship attribution have been text data in which boundaries between words are obvious [1] [2]. When we apply these studies to languages in which sentences could not be divided obviously into words, such as Japanese or Chinese, preliminary processing of text data such as morphological analysis is required and may influence the final results. The methods which make use of characteristics of particular languages or particular compositions also have limited coverage [3]. Extracting authors’ characteristics from sentences is generally an unsolved problem. Therefore, we propose a method for authorship attribution based on distribution of n-grams of characters in sentences. The proposed method can analyze sentences without any additional information, i.e. preliminary analyses. The experiments, where 3-grams to represent author’s characteristics were educed on the basis of their distributions, are also reported in the following.
[1] G. Yule. ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .
[2] J. Springer. A Mechanical Solution of a Literary Problem , 1923 .
[3] Matsuura Tsukasa,et al. Identifying Authors of Sentences in Japanese Modern Novels via Distribution of N-grams. , 1999 .