PrePer: A Pre-processor for Persian

Today web pages through World Wide Web are widely in use as rich resources for developing corpora. These useful source materials contain all sort of texts, including various encodings, and are written by many different authors in various styles. The existence of these factors make Persian text processing complex, therefore, when dealing with Persian, before any natural language processing takes place the input texts need to be prepared and cleaned up into standard texts. A standard text is a text written in standard style where the internal word boundaries are marked based on the official orthography and the style introduced by Academy of Persian Language and Literature (APLL). In the following Sections we will have an overview of text processing issues as well as our solution to pre-process Persian texts by introducing PrePer; a pre-processor for Persian.