A Proposal for the Integration of NLP Tools using SGML-Tagged Documents

' In this paper we present the strategy used for an integration, in a common framework, of the NLP tools developed for Basque during the last ten years. The documents used as input and output of the different tools contain TEI-conformant feature structures (FS) coded in SGML. These FSs describe the linguistic information that is exchanged among the integrated analysis tools. The tools integrated until now are a lexical database, a tokenizer, a wide-coverage morphosyntactic analyzer, and a general purpose tagger/lemmatizer. In the future we plan to integrate a shallow syntactic parser. Due to the complexity of the information to be exchanged among the different tools, FSs are used to represent it. Feature structures are coded following the TEI’s DTD for FSs, and Feature Structure Definition descriptions (FSD) have been thoroughly defined. The use of SGML for encoding the I/O streams flowing between programs forces us to formally describe the mark-up, and provides software to check that these mark-up hold invariantly in an annotated corpus. A library of Abstract Data Types representing the objects needed for the communication between the tools has been designed and implemented. It offers the necessary operations to get the information from an SGML document containing FSs, and to produce the corresponding output according to a well-defined FSD.