Good Reasons for Noting Bad Grammar : Constructing a Corpus of Ungrammatical Language

The original motivation for compiling the parallel corpora arose when considering the problem of parsing ill-formed language. While it is clear th at probabilistic parsers are more successful than traditional non-probabilistic parse rs at actually returning an analysis for an ungrammatical sentence, the analysis they retur n won’t necessarily reflect the sentence’s meaning if they don’t know that sentences can ometimes be ill-formed. A realistic grammar, probabilistic or not, will have a conce pt of ungrammaticality. Such a concept should be informed by authentic ungrammatica l language as opposed to the invented strings often used by linguists. The relatio nship between the ungrammatical sentences in the first corpus and their grammatical c ounterparts in the second provides an explicit characterization of the ways in which s entences can become deviant. Not only is this information useful within the practi cal domain of parsing, it also useful within linguistics, as a form of evidence for theorie s of grammar.