Statistical study on a literary Romanian corpus for the beginning and ending of the words
暂无分享,去创建一个
The paper attempts to investigate the statistical structure of letters and of letter digrams with which the words begin and end, as well as of trigrams that link two successive words. The investigation is carried out on a printed Romanian literary corpus summing up about 12.5 million words. The impact of the orthography and punctuation marks in the language model assigned to the beginning and to the ending of words is considered.
[1] Mihai Mitrea,et al. Printed Romanian Modelling: A Corpus Linguistics Based Study with Orthography and Punctuation Marks Included , 2007, ICCSA.
[2] Adriana Vlad,et al. A study on the statistical structure of words and of word digrams in a literary romanian corpus , 2011, 2011 6th Conference on Speech Technology and Human-Computer Dialogue (SpeD).