Grammatical Analysis by Computer of the Lancaster-Oslo/Bergen (LOB) Corpus of British English Texts

Research has been under way at the unit for Computer Research on the English Language at the University of Lancaster, England, to develop a suite of computer programs which provide a detailed grammatical analysis of the LOB corpus, a collection of about 1 million words of British English texts available in machine readable form.The first phrase of the project, completed in September 1983, produced a grammatically annotated version of the corpus giving a tag showing the word class of each word token. Over 93 per cent of the word tags were correctly selected by using a matrix of tag pair probabilities and this figure was upgraded by a further 3 per cent by retagging problematic strings of words prior to disambiguation and by altering the probability weightings for sequences of three tags. The remaining 3 to 4 per cent were corrected by a human post-editor.The system was originally designed to run in batch mode over the corpus but we have recently modified procedures to run interactively for sample sentences typed in by a user at a terminal. We are currently extending the word tag set and improving the word tagging procedures to further reduce manual intervention. A similar probabilistic system is being developed for phrase and clause tagging.