On Consensus between Tree-Representations of Linguistic Data

One of the aims of linguistics is to infer from the ever-growing mass of actual data available the implicit, virtual organization underlying the apparent disorder and diversity of surface phenomena. The procedure invariably consists in establishing and revealing the latent links between the fundamental entities justifying the dichotomy between language and speech, namely langue and parole (Saussure), competence and performance (Chomsky), system and process, substance and form (Hjelmslev), or sense and signification (Guillaume). This ever-present crucial duality is also at work in computational linguistics, where the chief question is how to reach, beyond the surface of observed facts, for the latent abstract organization, thus enabling the observer (i.e. the linguist) to gain access to knowledge that can be generalized. Tree-representation is a powerful means of evincing the inherent structure of mutually dependent data to account for the respective dependence or independence of the represented objects by means of a hierarchic tree where clearly outlined categories are paired and embedded. Frequently, modern linguists tend to be interested more in the relative closeness of objects than in their belonging to this or that closed class. Additive, as opposed to hierarchic, trees do away with watertight partitions between objects and lay the stress on notions such as proximity and opposition. The question of course arises of the possibility of representing any two distinct sets of original data in one tree-figure. The only procedure available is to attempt to achieve a consensus by fusion of the original two trees by means of a new algorithm. The algorithm is explained in detail and applied to various subsets of the tagged version of the LOB Corpus.