Error Typology for Automatic Proof-reading Purposes

Executive Summary The error typology is a hierarchially organised classification system for all kinds of language related errors found in contemporary Swedish newspaper articles. The error typology is to be used in the development of a proofreading tool for Danish, Norwegian, and Swedish in the SCARRIE project. In specific, the typology forms a basis for the error type code attached to each entry in the Error Corpora Database (ECD), and for the parser in the resulting proofreading system. It is of great importance in the development of the proofreading tool to know what types of errors that in fact occur in newspapers, and to have these systematised in an appropriate manner. Potential errors have not been considered, which means that the typology is solely based on factual errors and not on hypothetical ones. The Swedish newspapers Svenska Dagbladet and Upsala Nya Tidning have supplied material for the development of the error typology and the ECD where all the error instances with their corrections and error types codes are stored. The language errors have been detected and corrected by professional proofreaders at the newspapers. The typology is descriptive, not normative. There are at least four possible dimensions according to which a division between errors could be made: the nature of the error, the cause of the error, the context in which the error appears, and the correction of the error. An error must be recognised before it can be corrected. Therefore, the erroneous feature and the context are the most important characteristics. The principle is thus that two errors of the same kind appearing in a similar context may be given the same error type code even if there might be differences in how the errors could be corrected. The cause of the error has been given the lowest priority. For automatic proofreading purposes, the cause was found to be of less interest than it would be for peda-gogical purposes. The strategy of the proofreading tool has been taken into consideration while constructing the error typology. The grammar checker will use a combined approach of linguistic analyses and the application of rules of anticipated errors. Correction will be based on a grammar of foreseen errors. Consistency with regard to standard or style will also be checked. Style checking will concentrate on lexical choice, variation in inflection and, to some extent, syntax. Errors in newspapers may be of many different types. To …