A Flexible Workbench for Document Analysis and Text Mining

Document analysis and text mining techniques are used to pre-process documents in information retrieval systems, to extract concepts in ontology construction processes, and to discover and classify knowledge along several dimensions. In most cases it is not obvious how the techniques should be configured and combined, and it is a time-consuming process to set up and test various combinations of techniques. In this paper, we present a workbench that makes it easy to plug in new document analysis and text mining techniques and experiment with different constellations of techniques. We explain the architecture of the workbench and show how the workbench has been used to extract ontological concepts and relationships for a document collection published by the Norwegian Center for Medical Informatics.

[1]  Hatem Haddad Combining Text Mining and NLP for Information Retrieval , 2002, IC-AI.

[2]  David Faure,et al.  First experiences of using semantic knowledge learned by ASIUM for information extraction task using INTEX , 2000, ECAI Workshop on Ontology Learning.

[3]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[4]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[5]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[6]  Aldo Gangemi,et al.  Ontology Learning and Its Application to Automated Terminology Translation , 2003, IEEE Intell. Syst..

[7]  Jon Atle Gulla,et al.  Natural Language Analysis for Semantic Document Modeling , 2000, NLDB.

[8]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[9]  Jon Atle Gulla,et al.  Linguistics in Large-Scale Web Search , 2002, NLDB.

[10]  Christine Jacquin,et al.  Indexing a web site with a terminology oriented ontology , 2001, SWWS.

[11]  Dietmar F. Rösner,et al.  An XML-based Approach for the Presentation and Exploitation of Extracted Information , 2001 .

[12]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .