In this paper, we propose an architecture, called UCSG Shallow Parsing Architecture, for building wide coverage shallow parsers by using a judicious combination of linguistic and statistical techniques without need for large amount of parsed training corpus to start with. We only need a large POS tagged corpus. A parsed corpus can be developed using the architecture with minimal manual effort, and such a corpus can be used for evaluation as also for performance improvement. The UCSG architecture is designed to be extended into a full parsing system but the current work is limited to chunking and obtaining appropriate chunk sequences for a given sentence. In the UCSG architecture, a Finite State Grammar is designed to accept all possible chunks, referred to as word groups here. A separate statistical component, encoded in HMMs (Hidden Markov Model), has been used to rate and rank the word groups so produced. Note that we are not pruning, we are only rating and ranking the word groups already obtained. Then we use a Best First Search strategy to produce parse outputs in best first order, without compromising on the ability to produce all possible parses in principle. We propose a bootstrapping strategy for improving HMM parameters and hence the performance of the parser as a whole. A wide coverage shallow parser has been implemented for English starting from the British National Corpus, a nearly 100 Million word POS tagged corpus. Note that the corpus is not a parsed corpus. Also, there are tagging errors, multiple tags assigned in many cases, and some words have not been tagged. A dictionary of 138,000 words with frequency counts for each word in each tag has been built. Extensive experiments have been carried out to evaluate the performance of the various modules. We work with large data sets and performance obtained is encouraging. A manually checked parsed corpus of 4000 sentences has also been developed and used to improve the parsing performance further. The entire system has been implemented in Perl under Linux.
[1]
Gregory Grefenstette.
Light parsing as finite state filtering
,
1999
.
[2]
Albert Sydney Hornby,et al.
Guide to Patterns and Usage in English
,
1954
.
[3]
Erik F. Tjong Kim Sang,et al.
Memory-Based Shallow Parsing
,
2002,
J. Mach. Learn. Res..
[4]
Ferran Plà,et al.
Shallow Parsing using Specialized HMMs
,
2002,
J. Mach. Learn. Res..
[5]
Miles Osborne,et al.
Shallow Parsing using Noisy and Non-Stationary Training Material
,
2002,
J. Mach. Learn. Res..
[6]
Hervé Déjean.
Learning Rules and Their Exceptions
,
2002,
J. Mach. Learn. Res..
[7]
Yves Schabes,et al.
Parsing with Finite-State Transducers
,
1997
.
[8]
Mark Stevenson,et al.
The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources
,
2002,
LREC.
[9]
Beáta Megyesi,et al.
Shallow Parsing with PoS Taggers and Linguistic Features
,
2002,
J. Mach. Learn. Res..
[10]
Steven Abney,et al.
Parsing By Chunks
,
1991
.
[11]
Light Parsing as Finite-State
,
.
[12]
Steven P. Abney.
Partial parsing via finite-state cascades
,
1996,
Natural Language Engineering.
[13]
Sabine Buchholz,et al.
Introduction to the CoNLL-2000 Shared Task Chunking
,
2000,
CoNLL/LLL.
[14]
Frederick B. Thompson,et al.
English for the computer
,
1899,
AFIPS '66 (Fall).
[15]
Kavi Narayana Murthy,et al.
UCSG Shallow Parser
,
2006,
CICLing.
[16]
Xavier Carreras,et al.
Phrase recognition by filtering and ranking with perceptrons
,
2003,
RANLP.