DEVELOPMENT OF ALGORITHMS AND COMPUTATIONAL GRAMMAR FOR URDU

This work presents the linguistics-based grammar modeling of Urdu language under the framework of Lexical Functional Grammar (LFG) and at places under Head-driven Phrase Structure Grammar (HPSG). The grammar mode ling has been done by considering two interlinked parts: the morphology and the syntax. Urdu has a rich verb morphology comprising 60 basic verb forms categorized into infinitive, perfective, repetitive, subjunctive and imperative forms. The 60 forms are not enough to represent all the features of Urdu verbs. Various verb features are composed when verb auxiliaries and/or light verbs combine with these verb forms. Linguistically, verb auxiliaries are needed to combine at the syntactic level. However, this work shows that the grammar model is simplified and the complex agreement requirements can be avoided if auxiliaries are lumped with verb forms at the lexical level. The work proposes the analysis of perfective, progressive, repetitive and inceptive aspects as well as the analysis of declarative, permissive, prohibitive, imperative, capacitive, suggestive, compulsive, presumptive and subjunctive moods. The structure of a passive is analyzed by assuming a default argument. This work, based on difference in grammar modeling and conceptualization, classifies Urdu case markers and post-positions into noun forms, core case markers, functional case markers, possession markers and post-positions. Noun forms are modeled morphologically using lexical transducers, possession markers require two noun phrases, post-position appear as adjuncts, while core and functional case markers appear in the argument structure of verbs. To classify core and functional case markers the use of semantic features has been proposed. The semantic features based classification particularly demonstrated better taxonomy of different 'instrumental cases' in Urdu. This classification of 'instrumental case' exposed the presence of 'indirect subjects' for Urdu causative verbs which further suggested that some causative verbs are tetravalent because the argument structure of these verbs has four arguments. The study of case-markers reveals that the agreement between a noun and a case marker is difficult to handle. It is argued that the head of phrase should be a noun because the resultant is a noun phrase, but features of the case marker also transfer to the resultant phrase, therefore, a modification to head-feature rule is proposed. The same argument also helped to reaffirm that Urdu case markers are different from Urdu possession markers, which require a different rule needing two noun phrases as a specifier and a complement to make a resultant noun phrase. The adjective-noun agreement is also modeled on the same grounds for their gender and number agreement. The work proposes an algorithm for the parsing Urdu sentences based on Urdu closed-word-classes. This helps in identifying chunks based on the linguistic characteristics of the word classes. The rule selection is simplified by providing a guess of the word class that may appear before or after it. The work also presents a novel roman script for Urdu language for transliteration, which is not only phonetic like other roman scripts, but also makes possible to transfer text in this roman script to or from Urdu script, in both directions, using a computer program. This thesis, therefore, presents novel ideas for the computational grammar of Urdu, which can be utilized in various natural language processing tasks, such as machine translation, text summarization, grammar checker, information retrieval, etc.