Research has been under way at the unit for Computer Research on the English Language at the University of Lancaster, England, to develop a suite of computer programs which provide a detailed grammatical analysis of the LOB corpus, a collection of about 1 million words of British English texts available in machine readable form.The first phrase of the project, completed in September 1983, produced a grammatically annotated version of the corpus giving a tag showing the word class of each word token. Over 93 per cent of the word tags were correctly selected by using a matrix of tag pair probabilities and this figure was upgraded by a further 3 per cent by retagging problematic strings of words prior to disambiguation and by altering the probability weightings for sequences of three tags. The remaining 3 to 4 per cent were corrected by a human post-editor.The system was originally designed to run in batch mode over the corpus but we have recently modified procedures to run interactively for sample sentences typed in by a user at a terminal. We are currently extending the word tag set and improving the word tagging procedures to further reduce manual intervention. A similar probabilistic system is being developed for phrase and clause tagging.
[1]
R. Quirk.
A Grammar of contemporary English
,
1974
.
[2]
Ian Marshall,et al.
Choice of grammatical word-class without global syntactic analysis: Tagging words in the lob corpus
,
1983,
Comput. Humanit..
[3]
Geoffrey Leech,et al.
RECENT DEVELOPMENTS IN THE USE OF COMPUTER CORPORA IN ENGLISH LANGUAGE RESEARCH
,
1983
.
[4]
Paul Procter,et al.
Longman Dictionary of Contemporary English
,
1978
.
[5]
E. Kruisinga,et al.
An English grammar
,
1947
.
[6]
Johansson. Stig,et al.
Manual of information to accompany the Lancaster-Oslo : Bergen Corpus of British English, for use with digital computers
,
1978
.