Text analysis case study: Determining word Frequency based on Azerbaijan top 500 websites.
暂无分享,去创建一个
Word Frequency Distribution (WFD) is one the most important sub-areas of Natural Language Processing (NLP) and Computational Linguistic. The reliability and quality of WFD results are highly dependent on the size and quality of the corpora. In this paper describes the ongoing project with aim to build a corpus Azerbaijani text AzWebCorpus. Top 500 websites in Azerbaijan are used as a text source for corpus building. Most of essential tools including Web Crawler, Text Cleaner, Tokenizer have been developed and several opensource tools have been used. Moreover, AzWebCorpus compared to another corpus AzBookCorpus built on text taken from electronic books in terms of word frequency. Same approach that used in this paper is applicable for other languages.
[1] Abzetdin Adamov. Data mining and analysis in depth. case study of Qafqaz University HTTP server log analysis , 2014, 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT).
[2] Marco Baroni,et al. 37. Distributions in text , 2009 .