Structuring Taxonomies from Texts: A Case-Study on Defining Soil Classes

Currently, most of the information digitally available is presented in textual form and it is largely acknowledged that, in many fields, the advance of knowledge may strongly benefit from this source of information. The treatment of this vast amount of texts by means of Text Mining (TM) techniques has produced interesting information in fields like Competitive Intelligence and Bibliometry that need to make sense from textual descriptions of facts. In this paper we approach the problem of taxonomy generation from texts, a common need from a large set of scientific disciplines. Taxonomy generation refers to building a hierarchical structure that organizes concepts in a knowledge domain. We applied TM techniques to help experts in Pedology in building taxonomy from redundant soils descriptions. The motto of the application is the fact that, in the early eighties, different organizations mapped and described equivalent classes of soils from Brazilian savannas, generating redundant descriptions with different class labels. There were produced 28 soil maps that covered 4,101 descriptions of soil classes. This profusion of redundant soil descriptions clearly represents a Babel Tower that makes difficult tasks like environment management and food production. The proposed process is based in clustering analysis and runs on the soil descriptions, performing a successive refinement of the abstractions found in soil descriptions. The method builds a frame that shows, for each cluster formed, the prototype (a representative word vector) and the soil descriptions related to that cluster. The results have been analyzed by a team of experts as input information to the laborious reasoning process involved in building concepts from the semantic relations among the soil descriptions. Without a help like the present process, the experts would have to compare visually at least 4,101 × 4.100 × …× 1 soil descriptions to define the clusters, what is much more laborious.