Trainer beware: corpora for language/encoding identification

Training-based approaches to language processing require corpora. For example, corpora are being used for lexicon development, spelling correction and machine translation. Typically, one wants the corpora to reflect the type of data that is to be handled by the given system. The problem is that the real-world data is frequently noisy and can introduce problems in training-based approaches. The question, then, is if one should "clean up" the data before training and if so, how much? We have faced this very dilemma in the training and use of language and encoding identification algorithms. We will first discuss the problem of language and encoding identification. Then, we will describe the problems faced by our system and our initial attempts at handling these questions. Finally, we will examine the results of the exploration with some recommendations for researchers dealing with corpora-