Finding structure in language

Since the Chomskian revolution, it has become apparent that natural language is richly struc¬ tured, being naturally represented hierarchically, and requiring complex context sensitive rules to define regularities over these representations. It is widely assumed that the richness of the posited structure has strong nativist implications for mechanisms which might learn natural language, since it seemed unlikely that such structures could be derived directly from the ob¬ servation of linguistic data (Chomsky 1965). This thesis investigates the hypothesis that simple statistics of a large, noisy, unlabelled corpus of natural language can be exploited to discover some of the structure which exists in natural language automatically. The strategy is to initially assume no knowledge of the structures present in natural language, save that they might be found by analysing statistical regularities which pertain between a word and the words which typically surround it in the corpus. To achieve this, various statistical methods are applied to define similarity between statistical distributions, and to infer a structure for a domain given knowledge of the similarities which pertain within it. Using these tools, it is shown that it is possible to form a hierarchical classi¬ fication of many domains, including words in natural language. When this is done, it is shown that all the major syntactic categories can be obtained, and the classification is both relati¬ vely complete, and very much in accord with a standard linguistic conception of how words are classified in natural language. Once this has been done, the categorisation derived is used as the basis of a similar classification of short sequences of words. If these are analysed in a similar way, then several syntactic categories can be derived. These include simple noun phrases, various tensed forms of verbs, and simple prepositional phrases. Once this has been done, the same technique can be applied one level higher, and at this level simple sentences and verb phrases, as well as more complicated noun phrases and prepositional phrases, are shown to be derivable.