A Statistical Study of the WPT-03 Corpus

This report presents a statistical study of WPT-03, a text corpus built from the pages of the “Portuguese Web” collected in the repository of the tumba! search engine. We give a statistical analysis of the textual contents available in the Portuguese Web, including size distributions, the language of the pages, and the terms they contain.